Automated PII & PCI DSS Protection in Financial Data Lakes
By Braincuber Team
Published on February 4, 2026
At SecureBank Financial, the data engineering team discovered a nightmare during a routine audit. Credit card numbers were scattered across 47 different S3 buckets. Social security numbers appeared in application logs. Customer addresses lived in unencrypted Redshift tables that marketing had access to. The compliance team estimated it would take 6 months of manual work to find and remediate all sensitive data—time they didn't have with PCI DSS audit season approaching.
This guide shows you how to build an automated sensitive data detection and remediation system using AWS managed services. You'll learn to catch PII and PCI DSS violations at every stage of your data lake—from ingestion to consumption—without building custom solutions or hiring a compliance army.
What You'll Implement:
- Automated PII/PCI DSS scanning with Amazon Macie
- Real-time streaming data redaction with Kinesis + Comprehend
- ETL pipeline masking with AWS Glue transforms
- Query-time dynamic data masking in Redshift
- S3 Object Lambda for on-the-fly redaction
Why Financial Institutions Need Automated Protection
The stakes are high. Non-compliance with data protection regulations results in:
GDPR Fines
Up to €20M or 4% of global annual revenue—whichever is higher
PCI DSS Penalties
$5,000 to $100,000 per month until compliant, plus liability for fraud
Reputation Damage
65% of customers leave after a data breach at their bank
Solution Architecture
The solution protects sensitive data at every stage of the data lake lifecycle:
# Automated Sensitive Data Protection Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA INGESTION LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ [Batch Sources] [Streaming Sources] [Documents] │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ DMS │──Macie──▶ │ Firehose │──▶ │ Textract │ │
│ │ + Macie │ │ + Comprehend │ │ + Compre. │ │
│ └────┬────┘ └──────┬───────┘ └─────┬─────┘ │
│ │ │ │ │
│ └─────────────────────────┼────────────────────────┘ │
│ ▼ │
├─────────────────────────────────────────────────────────────────────────────┤
│ DATA LAKE (Amazon S3) │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Amazon Macie (Scheduled Scans) │ │
│ │ Detects: Credit Cards, SSN, Bank Accounts, PII │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ ETL & PROCESSING │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ AWS Glue (PII Detection Transform) │ │
│ │ Actions: DETECT, REDACT, SHA-256_HASH │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ DATA CONSUMPTION │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Redshift │ │ OpenSearch │ │ S3 Object │ │ CloudWatch │ │
│ │ DDM │ │ Field │ │ Lambda │ │ Logs │ │
│ │ (RBAC) │ │ Masking │ │ Redaction │ │ Protection │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Step 1: Scan S3 Buckets with Amazon Macie
Amazon Macie uses machine learning to automatically discover and classify sensitive data across your S3 buckets. Configure scheduled scans for continuous compliance:
# Enable Macie and create a classification job
import boto3
import json
macie = boto3.client('macie2')
# Enable Macie in your account
macie.enable_macie()
# Create a scheduled classification job
response = macie.create_classification_job(
name='FinancialDataLake-Weekly-Scan',
description='Weekly PII and PCI DSS scan of financial data lake',
jobType='SCHEDULED',
scheduleFrequency={
'weeklySchedule': {
'dayOfWeek': 'SUNDAY'
}
},
s3JobDefinition={
'bucketDefinitions': [
{
'accountId': '123456789012',
'buckets': [
'securebank-raw-data',
'securebank-processed-data',
'securebank-analytics'
]
}
],
'scoping': {
'includes': {
'and': [
{
'simpleScopeTerm': {
'comparator': 'STARTS_WITH',
'key': 'OBJECT_KEY',
'values': ['customers/', 'transactions/', 'accounts/']
}
}
]
}
}
},
managedDataIdentifierSelector='ALL', # Use all managed identifiers
tags={
'Environment': 'Production',
'Compliance': 'PCI-DSS'
}
)
print(f"Created classification job: {response['jobId']}")
Configure Custom Data Identifiers
For industry-specific patterns not covered by managed identifiers, create custom rules:
# Create custom data identifier for internal account numbers
response = macie.create_custom_data_identifier(
name='SecureBank-AccountNumber',
description='Matches SecureBank internal account format: SB-XXXX-XXXX-XXXX',
regex=r'SB-\d{4}-\d{4}-\d{4}',
keywords=['account', 'acct', 'account number'],
maximumMatchDistance=50,
severityLevels=[
{'occurrencesThreshold': 1, 'severity': 'HIGH'},
{'occurrencesThreshold': 10, 'severity': 'CRITICAL'}
]
)
# Custom identifier for loan application IDs
macie.create_custom_data_identifier(
name='SecureBank-LoanApplicationID',
regex=r'LOAN-\d{6}-[A-Z]{2}',
keywords=['loan', 'application', 'mortgage'],
maximumMatchDistance=30
)
Step 2: Real-Time Streaming Redaction
For streaming data, use Kinesis Data Firehose with Lambda transformation powered by Amazon Comprehend:
# Lambda function for Firehose PII redaction
import boto3
import json
import base64
comprehend = boto3.client('comprehend')
def lambda_handler(event, context):
output_records = []
for record in event['records']:
# Decode the incoming record
payload = base64.b64decode(record['data']).decode('utf-8')
data = json.loads(payload)
# Extract text fields to scan
text_to_scan = json.dumps(data)
# Detect PII entities
pii_response = comprehend.detect_pii_entities(
Text=text_to_scan,
LanguageCode='en'
)
# Redact detected PII
redacted_text = text_to_scan
for entity in sorted(pii_response['Entities'],
key=lambda x: x['BeginOffset'],
reverse=True):
entity_type = entity['Type']
start = entity['BeginOffset']
end = entity['EndOffset']
# Replace with redaction mask
mask = f"[{entity_type}_REDACTED]"
redacted_text = redacted_text[:start] + mask + redacted_text[end:]
# Encode the redacted record
output_records.append({
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(redacted_text.encode('utf-8')).decode('utf-8')
})
return {'records': output_records}
# PII types detected:
# CREDIT_DEBIT_NUMBER, CREDIT_DEBIT_CVV, CREDIT_DEBIT_EXPIRY
# BANK_ACCOUNT_NUMBER, BANK_ROUTING, SWIFT_CODE
# SSN, PASSPORT_NUMBER, DRIVER_ID
# NAME, EMAIL, PHONE, ADDRESS, DATE_TIME
Key Design: Sensitive data is redacted before it ever lands in S3. The original data never persists, reducing your compliance scope dramatically.
Step 3: AWS Glue PII Detection Transform
For ETL pipelines, AWS Glue provides built-in PII detection and processing transforms:
# AWS Glue ETL job with PII detection
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read source data
source_df = glueContext.create_dynamic_frame.from_catalog(
database="securebank_raw",
table_name="customer_transactions"
)
# Apply PII Detection Transform
pii_detected = source_df.detect_sensitive_data(
# Entity types to detect
entity_types_to_detect=[
"CREDIT_CARD",
"BANK_ACCOUNT",
"SSN",
"EMAIL",
"PHONE",
"NAME",
"ADDRESS"
],
# Detection sensitivity (FULL for cell-by-cell, SAMPLE for lower cost)
sample_fraction=1.0, # FULL scan
# Output column for detection results
detection_metadata_output_column="_pii_metadata"
)
# Apply PII Redaction Transform
pii_redacted = pii_detected.pii_filter(
entity_types_to_filter=[
"CREDIT_CARD",
"BANK_ACCOUNT",
"SSN"
],
# Action: DETECT, REDACT, PARTIAL_REDACT, SHA256_HASH
action="SHA256_HASH", # Hash sensitive values for analytics
# Or use PARTIAL_REDACT to keep last 4 digits
# action="PARTIAL_REDACT",
# partial_redact_config={"strategy": "LAST_4"}
)
# Write redacted data to processed zone
glueContext.write_dynamic_frame.from_options(
frame=pii_redacted,
connection_type="s3",
connection_options={
"path": "s3://securebank-processed-data/customer_transactions/"
},
format="parquet"
)
job.commit()
| Action | Description | Use Case |
|---|---|---|
| DETECT | Flag columns containing PII | Audit, compliance reporting |
| REDACT | Replace with asterisks (*****) | Display, non-sensitive storage |
| PARTIAL_REDACT | Keep first/last N characters | Customer verification (last 4 digits) |
| SHA256_HASH | One-way hash for analytics | Deduplication, joins, ML training |
Step 4: Dynamic Data Masking in Amazon Redshift
For data warehouse queries, implement role-based dynamic data masking (DDM) that masks data at query time:
-- Create masking policies in Amazon Redshift
-- Policy: Full masking for credit card numbers
CREATE MASKING POLICY mask_credit_card
WITH (credit_card VARCHAR(20))
USING ('XXXX-XXXX-XXXX-' || RIGHT(credit_card, 4));
-- Policy: Mask SSN except last 4 digits
CREATE MASKING POLICY mask_ssn
WITH (ssn VARCHAR(11))
USING ('XXX-XX-' || RIGHT(ssn, 4));
-- Policy: Mask email address
CREATE MASKING POLICY mask_email
WITH (email VARCHAR(255))
USING (
LEFT(email, 2) || '****@' ||
SPLIT_PART(email, '@', 2)
);
-- Policy: Full redaction for non-privileged roles
CREATE MASKING POLICY full_redact
WITH (value VARCHAR(MAX))
USING ('[REDACTED]');
-- Attach policies to tables with role-based conditions
ATTACH MASKING POLICY mask_credit_card
ON customers(credit_card_number)
TO ROLE analyst_role
PRIORITY 10;
ATTACH MASKING POLICY mask_ssn
ON customers(ssn)
TO ROLE analyst_role
PRIORITY 10;
-- Privileged role sees unmasked data
ATTACH MASKING POLICY full_redact
ON customers(credit_card_number)
TO ROLE marketing_role
PRIORITY 10;
-- Query results based on role:
-- Analyst sees: XXXX-XXXX-XXXX-1234
-- Marketing sees: [REDACTED]
-- Admin sees: 4111-1111-1111-1234
Step 5: S3 Object Lambda for On-Demand Redaction
For applications reading directly from S3, use Object Lambda to redact sensitive data before it reaches the requestor:
# S3 Object Lambda function for PII redaction
import boto3
import json
import requests
comprehend = boto3.client('comprehend')
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Get the original object from S3
object_context = event['getObjectContext']
request_route = object_context['outputRoute']
request_token = object_context['outputToken']
s3_url = object_context['inputS3Url']
# Get the requester's IAM role
user_arn = event['userRequest']['requesterArn']
# Fetch the original object
response = requests.get(s3_url)
original_content = response.text
# Check if user has PII access (simplified check)
has_pii_access = 'DataAdmin' in user_arn or 'PrivilegedRole' in user_arn
if has_pii_access:
# Return unmodified content for privileged users
transformed_content = original_content
else:
# Detect and redact PII for other users
pii_response = comprehend.detect_pii_entities(
Text=original_content,
LanguageCode='en'
)
transformed_content = original_content
for entity in sorted(pii_response['Entities'],
key=lambda x: x['BeginOffset'],
reverse=True):
start = entity['BeginOffset']
end = entity['EndOffset']
entity_type = entity['Type']
transformed_content = (
transformed_content[:start] +
f'[{entity_type}]' +
transformed_content[end:]
)
# Write the transformed object back to S3 Object Lambda
s3.write_get_object_response(
Body=transformed_content,
RequestRoute=request_route,
RequestToken=request_token
)
return {'statusCode': 200}
Step 6: CloudWatch Logs Data Protection
Application logs often contain sensitive data. Enable automatic masking with CloudWatch Logs data protection policies:
# Create CloudWatch Logs data protection policy
import boto3
import json
logs = boto3.client('logs')
data_protection_policy = {
"Name": "SecureBankLogProtection",
"Description": "Protect PII in application logs",
"Version": "2021-06-01",
"Statement": [
{
"Sid": "audit-policy",
"DataIdentifier": [
"arn:aws:dataprotection::aws:data-identifier/CreditCardNumber",
"arn:aws:dataprotection::aws:data-identifier/SsnUs",
"arn:aws:dataprotection::aws:data-identifier/BankAccountNumber-US",
"arn:aws:dataprotection::aws:data-identifier/EmailAddress",
"arn:aws:dataprotection::aws:data-identifier/PhoneNumber-US"
],
"Operation": {
"Audit": {
"FindingsDestination": {
"CloudWatchLogs": {
"LogGroup": "/aws/securebank/pii-findings"
},
"S3": {
"Bucket": "securebank-compliance-logs"
}
}
}
}
},
{
"Sid": "redact-policy",
"DataIdentifier": [
"arn:aws:dataprotection::aws:data-identifier/CreditCardNumber",
"arn:aws:dataprotection::aws:data-identifier/SsnUs"
],
"Operation": {
"Deidentify": {
"MaskConfig": {}
}
}
}
]
}
# Apply policy to log groups
logs.put_data_protection_policy(
logGroupIdentifier='/aws/lambda/securebank-api',
policyDocument=json.dumps(data_protection_policy)
)
# Logs now show: "Credit card ending in ****" instead of full number
Step 7: Data Lifecycle Management
PCI DSS v4.0.1 Requirement 3.2 mandates minimum data retention. Implement automatic deletion:
# S3 Lifecycle policy for sensitive data
{
"Rules": [
{
"ID": "DeletePIIAfter90Days",
"Status": "Enabled",
"Filter": {
"Prefix": "raw-pii/"
},
"Expiration": {
"Days": 90
}
},
{
"ID": "TransitionToGlacierAfter30Days",
"Status": "Enabled",
"Filter": {
"Prefix": "archived-transactions/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "GLACIER_IR"
}
],
"Expiration": {
"Days": 2555 # 7 years for compliance
}
}
]
}
# DynamoDB TTL for session data
import boto3
dynamodb = boto3.client('dynamodb')
# Enable TTL on customer session table
dynamodb.update_time_to_live(
TableName='customer_sessions',
TimeToLiveSpecification={
'Enabled': True,
'AttributeName': 'expiration_time' # Unix timestamp
}
)
# Items automatically deleted when current_time > expiration_time
Compliance Monitoring Dashboard
Aggregate findings across all protection layers into a unified compliance view:
# Query Macie findings via Athena
SELECT
category,
severity_description,
COUNT(*) as finding_count,
COUNT(DISTINCT s3_bucket) as affected_buckets
FROM macie_findings
WHERE created_at >= DATE_ADD('day', -30, CURRENT_DATE)
GROUP BY category, severity_description
ORDER BY finding_count DESC;
-- Sample output:
-- category | severity | finding_count | affected_buckets
-- CREDENTIAL | HIGH | 47 | 3
-- FINANCIAL | CRITICAL | 12 | 2
-- PERSONAL | MEDIUM | 156 | 8
Frequently Asked Questions
Conclusion
Protecting sensitive data in a financial data lake isn't optional—it's a regulatory requirement with severe consequences for non-compliance. The architecture outlined here implements defense-in-depth: catching PII at ingestion with Firehose and Comprehend, scanning storage with Macie, masking during ETL with Glue transforms, and applying role-based access at query time with Redshift DDM and S3 Object Lambda.
Start with Macie scans to understand your current exposure, then progressively implement redaction at each layer. The managed services approach means your security scales automatically with your data growth—no custom infrastructure to maintain, no model training required, and AWS handles the compliance certifications for the underlying services.
Need Help with Financial Data Compliance?
Our AWS certified security architects specialize in financial services data protection. We can help you design PCI DSS compliant architectures, implement automated PII detection, and prepare for compliance audits.
