How much does it cost to scan my entire S3 data lake?

Amazon Macie pricing is $1.00/GB for the first 50 TB. For a 10 TB data lake with weekly scans, expect ~$520/month. Reduce costs by using sampling mode or targeting specific high-risk prefixes.

Can I detect PII in non-English documents?

Amazon Comprehend supports English and Spanish. Macie's managed identifiers work across languages for pattern-based detection like credit cards and IBANs.

What's the performance impact of real-time redaction?

Kinesis+Lambda adds 200-500ms latency. S3 Object Lambda adds 100-300ms. Redshift DDM has minimal impact under 5ms since masking happens at query time.

How do I handle false positives in PII detection?

Start with DETECT mode before enabling redaction. Use allow-lists for known non-sensitive patterns. Tune confidence thresholds and review flagged items periodically.

Does this satisfy PCI DSS audit requirements?

This addresses PCI DSS Requirements 3, 7, and 10. Full compliance requires additional controls like network segmentation and key management. Work with your QSA for complete CDE scope validation.

Automated PII & PCI DSS Protection in Financial Data Lakes

At SecureBank Financial, the data engineering team discovered a nightmare during a routine audit. Credit card numbers were scattered across 47 different S3 buckets. Social security numbers appeared in application logs. Customer addresses lived in unencrypted Redshift tables that marketing had access to. The compliance team estimated it would take 6 months of manual work to find and remediate all sensitive data—time they didn't have with PCI DSS audit season approaching.

This guide shows you how to build an automated sensitive data detection and remediation system using AWS managed services. You'll learn to catch PII and PCI DSS violations at every stage of your data lake—from ingestion to consumption—without building custom solutions or hiring a compliance army.

What You'll Implement:

Automated PII/PCI DSS scanning with Amazon Macie
Real-time streaming data redaction with Kinesis + Comprehend
ETL pipeline masking with AWS Glue transforms
Query-time dynamic data masking in Redshift
S3 Object Lambda for on-the-fly redaction

Why Financial Institutions Need Automated Protection

The stakes are high. Non-compliance with data protection regulations results in:

GDPR Fines

Up to €20M or 4% of global annual revenue—whichever is higher

PCI DSS Penalties

$5,000 to $100,000 per month until compliant, plus liability for fraud

Reputation Damage

65% of customers leave after a data breach at their bank

Solution Architecture

The solution protects sensitive data at every stage of the data lake lifecycle:

# Automated Sensitive Data Protection Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                          DATA INGESTION LAYER                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   [Batch Sources]          [Streaming Sources]        [Documents]           │
│        │                         │                        │                 │
│        ▼                         ▼                        ▼                 │
│   ┌─────────┐             ┌──────────────┐          ┌───────────┐           │
│   │   DMS   │──Macie──▶   │   Firehose   │──▶       │ Textract  │           │
│   │ + Macie │             │ + Comprehend │          │ + Compre. │           │
│   └────┬────┘             └──────┬───────┘          └─────┬─────┘           │
│        │                         │                        │                 │
│        └─────────────────────────┼────────────────────────┘                 │
│                                  ▼                                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                          DATA LAKE (Amazon S3)                              │
│                                                                              │
│   ┌──────────────────────────────────────────────────────────────────────┐  │
│   │                    Amazon Macie (Scheduled Scans)                    │  │
│   │          Detects: Credit Cards, SSN, Bank Accounts, PII              │  │
│   └──────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                          ETL & PROCESSING                                    │
│                                                                              │
│   ┌──────────────────────────────────────────────────────────────────────┐  │
│   │                AWS Glue (PII Detection Transform)                    │  │
│   │              Actions: DETECT, REDACT, SHA-256_HASH                   │  │
│   └──────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                          DATA CONSUMPTION                                    │
│                                                                              │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│   │  Redshift   │    │ OpenSearch  │    │ S3 Object   │    │  CloudWatch │  │
│   │    DDM      │    │   Field     │    │   Lambda    │    │    Logs     │  │
│   │  (RBAC)     │    │   Masking   │    │  Redaction  │    │  Protection │  │
│   └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Step 1: Scan S3 Buckets with Amazon Macie

Amazon Macie uses machine learning to automatically discover and classify sensitive data across your S3 buckets. Configure scheduled scans for continuous compliance:

# Enable Macie and create a classification job
import boto3
import json

macie = boto3.client('macie2')

# Enable Macie in your account
macie.enable_macie()

# Create a scheduled classification job
response = macie.create_classification_job(
    name='FinancialDataLake-Weekly-Scan',
    description='Weekly PII and PCI DSS scan of financial data lake',
    jobType='SCHEDULED',
    scheduleFrequency={
        'weeklySchedule': {
            'dayOfWeek': 'SUNDAY'
        }
    },
    s3JobDefinition={
        'bucketDefinitions': [
            {
                'accountId': '123456789012',
                'buckets': [
                    'securebank-raw-data',
                    'securebank-processed-data',
                    'securebank-analytics'
                ]
            }
        ],
        'scoping': {
            'includes': {
                'and': [
                    {
                        'simpleScopeTerm': {
                            'comparator': 'STARTS_WITH',
                            'key': 'OBJECT_KEY',
                            'values': ['customers/', 'transactions/', 'accounts/']
                        }
                    }
                ]
            }
        }
    },
    managedDataIdentifierSelector='ALL',  # Use all managed identifiers
    tags={
        'Environment': 'Production',
        'Compliance': 'PCI-DSS'
    }
)

print(f"Created classification job: {response['jobId']}")

Configure Custom Data Identifiers

For industry-specific patterns not covered by managed identifiers, create custom rules:

# Create custom data identifier for internal account numbers
response = macie.create_custom_data_identifier(
    name='SecureBank-AccountNumber',
    description='Matches SecureBank internal account format: SB-XXXX-XXXX-XXXX',
    regex=r'SB-\d{4}-\d{4}-\d{4}',
    keywords=['account', 'acct', 'account number'],
    maximumMatchDistance=50,
    severityLevels=[
        {'occurrencesThreshold': 1, 'severity': 'HIGH'},
        {'occurrencesThreshold': 10, 'severity': 'CRITICAL'}
    ]
)

# Custom identifier for loan application IDs
macie.create_custom_data_identifier(
    name='SecureBank-LoanApplicationID',
    regex=r'LOAN-\d{6}-[A-Z]{2}',
    keywords=['loan', 'application', 'mortgage'],
    maximumMatchDistance=30
)

Step 2: Real-Time Streaming Redaction

For streaming data, use Kinesis Data Firehose with Lambda transformation powered by Amazon Comprehend:

# Lambda function for Firehose PII redaction
import boto3
import json
import base64

comprehend = boto3.client('comprehend')

def lambda_handler(event, context):
    output_records = []
    
    for record in event['records']:
        # Decode the incoming record
        payload = base64.b64decode(record['data']).decode('utf-8')
        data = json.loads(payload)
        
        # Extract text fields to scan
        text_to_scan = json.dumps(data)
        
        # Detect PII entities
        pii_response = comprehend.detect_pii_entities(
            Text=text_to_scan,
            LanguageCode='en'
        )
        
        # Redact detected PII
        redacted_text = text_to_scan
        for entity in sorted(pii_response['Entities'], 
                            key=lambda x: x['BeginOffset'], 
                            reverse=True):
            entity_type = entity['Type']
            start = entity['BeginOffset']
            end = entity['EndOffset']
            
            # Replace with redaction mask
            mask = f"[{entity_type}_REDACTED]"
            redacted_text = redacted_text[:start] + mask + redacted_text[end:]
        
        # Encode the redacted record
        output_records.append({
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(redacted_text.encode('utf-8')).decode('utf-8')
        })
    
    return {'records': output_records}

# PII types detected: 
# CREDIT_DEBIT_NUMBER, CREDIT_DEBIT_CVV, CREDIT_DEBIT_EXPIRY
# BANK_ACCOUNT_NUMBER, BANK_ROUTING, SWIFT_CODE
# SSN, PASSPORT_NUMBER, DRIVER_ID
# NAME, EMAIL, PHONE, ADDRESS, DATE_TIME

Key Design: Sensitive data is redacted before it ever lands in S3. The original data never persists, reducing your compliance scope dramatically.

Step 3: AWS Glue PII Detection Transform

For ETL pipelines, AWS Glue provides built-in PII detection and processing transforms:

# AWS Glue ETL job with PII detection
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read source data
source_df = glueContext.create_dynamic_frame.from_catalog(
    database="securebank_raw",
    table_name="customer_transactions"
)

# Apply PII Detection Transform
pii_detected = source_df.detect_sensitive_data(
    # Entity types to detect
    entity_types_to_detect=[
        "CREDIT_CARD",
        "BANK_ACCOUNT",
        "SSN",
        "EMAIL",
        "PHONE",
        "NAME",
        "ADDRESS"
    ],
    # Detection sensitivity (FULL for cell-by-cell, SAMPLE for lower cost)
    sample_fraction=1.0,  # FULL scan
    # Output column for detection results
    detection_metadata_output_column="_pii_metadata"
)

# Apply PII Redaction Transform
pii_redacted = pii_detected.pii_filter(
    entity_types_to_filter=[
        "CREDIT_CARD",
        "BANK_ACCOUNT", 
        "SSN"
    ],
    # Action: DETECT, REDACT, PARTIAL_REDACT, SHA256_HASH
    action="SHA256_HASH",  # Hash sensitive values for analytics
    # Or use PARTIAL_REDACT to keep last 4 digits
    # action="PARTIAL_REDACT",
    # partial_redact_config={"strategy": "LAST_4"}
)

# Write redacted data to processed zone
glueContext.write_dynamic_frame.from_options(
    frame=pii_redacted,
    connection_type="s3",
    connection_options={
        "path": "s3://securebank-processed-data/customer_transactions/"
    },
    format="parquet"
)

job.commit()

Action	Description	Use Case
DETECT	Flag columns containing PII	Audit, compliance reporting
REDACT	Replace with asterisks (*****)	Display, non-sensitive storage
PARTIAL_REDACT	Keep first/last N characters	Customer verification (last 4 digits)
SHA256_HASH	One-way hash for analytics	Deduplication, joins, ML training

Step 4: Dynamic Data Masking in Amazon Redshift

For data warehouse queries, implement role-based dynamic data masking (DDM) that masks data at query time:

-- Create masking policies in Amazon Redshift

-- Policy: Full masking for credit card numbers
CREATE MASKING POLICY mask_credit_card
WITH (credit_card VARCHAR(20))
USING ('XXXX-XXXX-XXXX-' || RIGHT(credit_card, 4));

-- Policy: Mask SSN except last 4 digits
CREATE MASKING POLICY mask_ssn
WITH (ssn VARCHAR(11))
USING ('XXX-XX-' || RIGHT(ssn, 4));

-- Policy: Mask email address
CREATE MASKING POLICY mask_email
WITH (email VARCHAR(255))
USING (
    LEFT(email, 2) || '****@' || 
    SPLIT_PART(email, '@', 2)
);

-- Policy: Full redaction for non-privileged roles
CREATE MASKING POLICY full_redact
WITH (value VARCHAR(MAX))
USING ('[REDACTED]');

-- Attach policies to tables with role-based conditions
ATTACH MASKING POLICY mask_credit_card 
ON customers(credit_card_number)
TO ROLE analyst_role
PRIORITY 10;

ATTACH MASKING POLICY mask_ssn
ON customers(ssn)
TO ROLE analyst_role
PRIORITY 10;

-- Privileged role sees unmasked data
ATTACH MASKING POLICY full_redact
ON customers(credit_card_number)
TO ROLE marketing_role
PRIORITY 10;

-- Query results based on role:
-- Analyst sees:  XXXX-XXXX-XXXX-1234
-- Marketing sees: [REDACTED]
-- Admin sees:     4111-1111-1111-1234

Step 5: S3 Object Lambda for On-Demand Redaction

For applications reading directly from S3, use Object Lambda to redact sensitive data before it reaches the requestor:

# S3 Object Lambda function for PII redaction
import boto3
import json
import requests

comprehend = boto3.client('comprehend')
s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Get the original object from S3
    object_context = event['getObjectContext']
    request_route = object_context['outputRoute']
    request_token = object_context['outputToken']
    s3_url = object_context['inputS3Url']
    
    # Get the requester's IAM role
    user_arn = event['userRequest']['requesterArn']
    
    # Fetch the original object
    response = requests.get(s3_url)
    original_content = response.text
    
    # Check if user has PII access (simplified check)
    has_pii_access = 'DataAdmin' in user_arn or 'PrivilegedRole' in user_arn
    
    if has_pii_access:
        # Return unmodified content for privileged users
        transformed_content = original_content
    else:
        # Detect and redact PII for other users
        pii_response = comprehend.detect_pii_entities(
            Text=original_content,
            LanguageCode='en'
        )
        
        transformed_content = original_content
        for entity in sorted(pii_response['Entities'],
                            key=lambda x: x['BeginOffset'],
                            reverse=True):
            start = entity['BeginOffset']
            end = entity['EndOffset']
            entity_type = entity['Type']
            
            transformed_content = (
                transformed_content[:start] + 
                f'[{entity_type}]' + 
                transformed_content[end:]
            )
    
    # Write the transformed object back to S3 Object Lambda
    s3.write_get_object_response(
        Body=transformed_content,
        RequestRoute=request_route,
        RequestToken=request_token
    )
    
    return {'statusCode': 200}

Step 6: CloudWatch Logs Data Protection

Application logs often contain sensitive data. Enable automatic masking with CloudWatch Logs data protection policies:

# Create CloudWatch Logs data protection policy
import boto3
import json

logs = boto3.client('logs')

data_protection_policy = {
    "Name": "SecureBankLogProtection",
    "Description": "Protect PII in application logs",
    "Version": "2021-06-01",
    "Statement": [
        {
            "Sid": "audit-policy",
            "DataIdentifier": [
                "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber",
                "arn:aws:dataprotection::aws:data-identifier/SsnUs",
                "arn:aws:dataprotection::aws:data-identifier/BankAccountNumber-US",
                "arn:aws:dataprotection::aws:data-identifier/EmailAddress",
                "arn:aws:dataprotection::aws:data-identifier/PhoneNumber-US"
            ],
            "Operation": {
                "Audit": {
                    "FindingsDestination": {
                        "CloudWatchLogs": {
                            "LogGroup": "/aws/securebank/pii-findings"
                        },
                        "S3": {
                            "Bucket": "securebank-compliance-logs"
                        }
                    }
                }
            }
        },
        {
            "Sid": "redact-policy",
            "DataIdentifier": [
                "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber",
                "arn:aws:dataprotection::aws:data-identifier/SsnUs"
            ],
            "Operation": {
                "Deidentify": {
                    "MaskConfig": {}
                }
            }
        }
    ]
}

# Apply policy to log groups
logs.put_data_protection_policy(
    logGroupIdentifier='/aws/lambda/securebank-api',
    policyDocument=json.dumps(data_protection_policy)
)

# Logs now show: "Credit card ending in ****" instead of full number

Step 7: Data Lifecycle Management

PCI DSS v4.0.1 Requirement 3.2 mandates minimum data retention. Implement automatic deletion:

# S3 Lifecycle policy for sensitive data
{
    "Rules": [
        {
            "ID": "DeletePIIAfter90Days",
            "Status": "Enabled",
            "Filter": {
                "Prefix": "raw-pii/"
            },
            "Expiration": {
                "Days": 90
            }
        },
        {
            "ID": "TransitionToGlacierAfter30Days",
            "Status": "Enabled",
            "Filter": {
                "Prefix": "archived-transactions/"
            },
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "GLACIER_IR"
                }
            ],
            "Expiration": {
                "Days": 2555  # 7 years for compliance
            }
        }
    ]
}

# DynamoDB TTL for session data
import boto3

dynamodb = boto3.client('dynamodb')

# Enable TTL on customer session table
dynamodb.update_time_to_live(
    TableName='customer_sessions',
    TimeToLiveSpecification={
        'Enabled': True,
        'AttributeName': 'expiration_time'  # Unix timestamp
    }
)

# Items automatically deleted when current_time > expiration_time

Compliance Monitoring Dashboard

Aggregate findings across all protection layers into a unified compliance view:

# Query Macie findings via Athena
SELECT 
    category,
    severity_description,
    COUNT(*) as finding_count,
    COUNT(DISTINCT s3_bucket) as affected_buckets
FROM macie_findings
WHERE created_at >= DATE_ADD('day', -30, CURRENT_DATE)
GROUP BY category, severity_description
ORDER BY finding_count DESC;

-- Sample output:
-- category          | severity | finding_count | affected_buckets
-- CREDENTIAL        | HIGH     | 47            | 3
-- FINANCIAL         | CRITICAL | 12            | 2
-- PERSONAL          | MEDIUM   | 156           | 8

Frequently Asked Questions

Conclusion

Protecting sensitive data in a financial data lake isn't optional—it's a regulatory requirement with severe consequences for non-compliance. The architecture outlined here implements defense-in-depth: catching PII at ingestion with Firehose and Comprehend, scanning storage with Macie, masking during ETL with Glue transforms, and applying role-based access at query time with Redshift DDM and S3 Object Lambda.

Start with Macie scans to understand your current exposure, then progressively implement redaction at each layer. The managed services approach means your security scales automatically with your data growth—no custom infrastructure to maintain, no model training required, and AWS handles the compliance certifications for the underlying services.

Need Help with Financial Data Compliance?

Our AWS certified security architects specialize in financial services data protection. We can help you design PCI DSS compliant architectures, implement automated PII detection, and prepare for compliance audits.

What You'll Implement:

Automated PII/PCI DSS scanning with Amazon Macie
Real-time streaming data redaction with Kinesis + Comprehend
ETL pipeline masking with AWS Glue transforms
Query-time dynamic data masking in Redshift
S3 Object Lambda for on-the-fly redaction

Why Financial Institutions Need Automated Protection

The stakes are high. Non-compliance with data protection regulations results in:

GDPR Fines

Up to €20M or 4% of global annual revenue—whichever is higher

PCI DSS Penalties

$5,000 to $100,000 per month until compliant, plus liability for fraud

Reputation Damage

65% of customers leave after a data breach at their bank

Solution Architecture

The solution protects sensitive data at every stage of the data lake lifecycle:

# Automated Sensitive Data Protection Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                          DATA INGESTION LAYER                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   [Batch Sources]          [Streaming Sources]        [Documents]           │
│        │                         │                        │                 │
│        ▼                         ▼                        ▼                 │
│   ┌─────────┐             ┌──────────────┐          ┌───────────┐           │
│   │   DMS   │──Macie──▶   │   Firehose   │──▶       │ Textract  │           │
│   │ + Macie │             │ + Comprehend │          │ + Compre. │           │
│   └────┬────┘             └──────┬───────┘          └─────┬─────┘           │
│        │                         │                        │                 │
│        └─────────────────────────┼────────────────────────┘                 │
│                                  ▼                                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                          DATA LAKE (Amazon S3)                              │
│                                                                              │
│   ┌──────────────────────────────────────────────────────────────────────┐  │
│   │                    Amazon Macie (Scheduled Scans)                    │  │
│   │          Detects: Credit Cards, SSN, Bank Accounts, PII              │  │
│   └──────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                          ETL & PROCESSING                                    │
│                                                                              │
│   ┌──────────────────────────────────────────────────────────────────────┐  │
│   │                AWS Glue (PII Detection Transform)                    │  │
│   │              Actions: DETECT, REDACT, SHA-256_HASH                   │  │
│   └──────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                          DATA CONSUMPTION                                    │
│                                                                              │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│   │  Redshift   │    │ OpenSearch  │    │ S3 Object   │    │  CloudWatch │  │
│   │    DDM      │    │   Field     │    │   Lambda    │    │    Logs     │  │
│   │  (RBAC)     │    │   Masking   │    │  Redaction  │    │  Protection │  │
│   └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Step 1: Scan S3 Buckets with Amazon Macie

Amazon Macie uses machine learning to automatically discover and classify sensitive data across your S3 buckets. Configure scheduled scans for continuous compliance:

# Enable Macie and create a classification job
import boto3
import json

macie = boto3.client('macie2')

# Enable Macie in your account
macie.enable_macie()

# Create a scheduled classification job
response = macie.create_classification_job(
    name='FinancialDataLake-Weekly-Scan',
    description='Weekly PII and PCI DSS scan of financial data lake',
    jobType='SCHEDULED',
    scheduleFrequency={
        'weeklySchedule': {
            'dayOfWeek': 'SUNDAY'
        }
    },
    s3JobDefinition={
        'bucketDefinitions': [
            {
                'accountId': '123456789012',
                'buckets': [
                    'securebank-raw-data',
                    'securebank-processed-data',
                    'securebank-analytics'
                ]
            }
        ],
        'scoping': {
            'includes': {
                'and': [
                    {
                        'simpleScopeTerm': {
                            'comparator': 'STARTS_WITH',
                            'key': 'OBJECT_KEY',
                            'values': ['customers/', 'transactions/', 'accounts/']
                        }
                    }
                ]
            }
        }
    },
    managedDataIdentifierSelector='ALL',  # Use all managed identifiers
    tags={
        'Environment': 'Production',
        'Compliance': 'PCI-DSS'
    }
)

print(f"Created classification job: {response['jobId']}")

Configure Custom Data Identifiers

For industry-specific patterns not covered by managed identifiers, create custom rules:

# Create custom data identifier for internal account numbers
response = macie.create_custom_data_identifier(
    name='SecureBank-AccountNumber',
    description='Matches SecureBank internal account format: SB-XXXX-XXXX-XXXX',
    regex=r'SB-\d{4}-\d{4}-\d{4}',
    keywords=['account', 'acct', 'account number'],
    maximumMatchDistance=50,
    severityLevels=[
        {'occurrencesThreshold': 1, 'severity': 'HIGH'},
        {'occurrencesThreshold': 10, 'severity': 'CRITICAL'}
    ]
)

# Custom identifier for loan application IDs
macie.create_custom_data_identifier(
    name='SecureBank-LoanApplicationID',
    regex=r'LOAN-\d{6}-[A-Z]{2}',
    keywords=['loan', 'application', 'mortgage'],
    maximumMatchDistance=30
)

Step 2: Real-Time Streaming Redaction

For streaming data, use Kinesis Data Firehose with Lambda transformation powered by Amazon Comprehend:

# Lambda function for Firehose PII redaction
import boto3
import json
import base64

comprehend = boto3.client('comprehend')

def lambda_handler(event, context):
    output_records = []
    
    for record in event['records']:
        # Decode the incoming record
        payload = base64.b64decode(record['data']).decode('utf-8')
        data = json.loads(payload)
        
        # Extract text fields to scan
        text_to_scan = json.dumps(data)
        
        # Detect PII entities
        pii_response = comprehend.detect_pii_entities(
            Text=text_to_scan,
            LanguageCode='en'
        )
        
        # Redact detected PII
        redacted_text = text_to_scan
        for entity in sorted(pii_response['Entities'], 
                            key=lambda x: x['BeginOffset'], 
                            reverse=True):
            entity_type = entity['Type']
            start = entity['BeginOffset']
            end = entity['EndOffset']
            
            # Replace with redaction mask
            mask = f"[{entity_type}_REDACTED]"
            redacted_text = redacted_text[:start] + mask + redacted_text[end:]
        
        # Encode the redacted record
        output_records.append({
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(redacted_text.encode('utf-8')).decode('utf-8')
        })
    
    return {'records': output_records}

# PII types detected: 
# CREDIT_DEBIT_NUMBER, CREDIT_DEBIT_CVV, CREDIT_DEBIT_EXPIRY
# BANK_ACCOUNT_NUMBER, BANK_ROUTING, SWIFT_CODE
# SSN, PASSPORT_NUMBER, DRIVER_ID
# NAME, EMAIL, PHONE, ADDRESS, DATE_TIME

Key Design: Sensitive data is redacted before it ever lands in S3. The original data never persists, reducing your compliance scope dramatically.

Step 3: AWS Glue PII Detection Transform

For ETL pipelines, AWS Glue provides built-in PII detection and processing transforms:

# AWS Glue ETL job with PII detection
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read source data
source_df = glueContext.create_dynamic_frame.from_catalog(
    database="securebank_raw",
    table_name="customer_transactions"
)

# Apply PII Detection Transform
pii_detected = source_df.detect_sensitive_data(
    # Entity types to detect
    entity_types_to_detect=[
        "CREDIT_CARD",
        "BANK_ACCOUNT",
        "SSN",
        "EMAIL",
        "PHONE",
        "NAME",
        "ADDRESS"
    ],
    # Detection sensitivity (FULL for cell-by-cell, SAMPLE for lower cost)
    sample_fraction=1.0,  # FULL scan
    # Output column for detection results
    detection_metadata_output_column="_pii_metadata"
)

# Apply PII Redaction Transform
pii_redacted = pii_detected.pii_filter(
    entity_types_to_filter=[
        "CREDIT_CARD",
        "BANK_ACCOUNT", 
        "SSN"
    ],
    # Action: DETECT, REDACT, PARTIAL_REDACT, SHA256_HASH
    action="SHA256_HASH",  # Hash sensitive values for analytics
    # Or use PARTIAL_REDACT to keep last 4 digits
    # action="PARTIAL_REDACT",
    # partial_redact_config={"strategy": "LAST_4"}
)

# Write redacted data to processed zone
glueContext.write_dynamic_frame.from_options(
    frame=pii_redacted,
    connection_type="s3",
    connection_options={
        "path": "s3://securebank-processed-data/customer_transactions/"
    },
    format="parquet"
)

job.commit()

Action	Description	Use Case
DETECT	Flag columns containing PII	Audit, compliance reporting
REDACT	Replace with asterisks (*****)	Display, non-sensitive storage
PARTIAL_REDACT	Keep first/last N characters	Customer verification (last 4 digits)
SHA256_HASH	One-way hash for analytics	Deduplication, joins, ML training

Step 4: Dynamic Data Masking in Amazon Redshift

For data warehouse queries, implement role-based dynamic data masking (DDM) that masks data at query time:

-- Create masking policies in Amazon Redshift

-- Policy: Full masking for credit card numbers
CREATE MASKING POLICY mask_credit_card
WITH (credit_card VARCHAR(20))
USING ('XXXX-XXXX-XXXX-' || RIGHT(credit_card, 4));

-- Policy: Mask SSN except last 4 digits
CREATE MASKING POLICY mask_ssn
WITH (ssn VARCHAR(11))
USING ('XXX-XX-' || RIGHT(ssn, 4));

-- Policy: Mask email address
CREATE MASKING POLICY mask_email
WITH (email VARCHAR(255))
USING (
    LEFT(email, 2) || '****@' || 
    SPLIT_PART(email, '@', 2)
);

-- Policy: Full redaction for non-privileged roles
CREATE MASKING POLICY full_redact
WITH (value VARCHAR(MAX))
USING ('[REDACTED]');

-- Attach policies to tables with role-based conditions
ATTACH MASKING POLICY mask_credit_card 
ON customers(credit_card_number)
TO ROLE analyst_role
PRIORITY 10;

ATTACH MASKING POLICY mask_ssn
ON customers(ssn)
TO ROLE analyst_role
PRIORITY 10;

-- Privileged role sees unmasked data
ATTACH MASKING POLICY full_redact
ON customers(credit_card_number)
TO ROLE marketing_role
PRIORITY 10;

-- Query results based on role:
-- Analyst sees:  XXXX-XXXX-XXXX-1234
-- Marketing sees: [REDACTED]
-- Admin sees:     4111-1111-1111-1234

Step 5: S3 Object Lambda for On-Demand Redaction

For applications reading directly from S3, use Object Lambda to redact sensitive data before it reaches the requestor:

# S3 Object Lambda function for PII redaction
import boto3
import json
import requests

comprehend = boto3.client('comprehend')
s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Get the original object from S3
    object_context = event['getObjectContext']
    request_route = object_context['outputRoute']
    request_token = object_context['outputToken']
    s3_url = object_context['inputS3Url']
    
    # Get the requester's IAM role
    user_arn = event['userRequest']['requesterArn']
    
    # Fetch the original object
    response = requests.get(s3_url)
    original_content = response.text
    
    # Check if user has PII access (simplified check)
    has_pii_access = 'DataAdmin' in user_arn or 'PrivilegedRole' in user_arn
    
    if has_pii_access:
        # Return unmodified content for privileged users
        transformed_content = original_content
    else:
        # Detect and redact PII for other users
        pii_response = comprehend.detect_pii_entities(
            Text=original_content,
            LanguageCode='en'
        )
        
        transformed_content = original_content
        for entity in sorted(pii_response['Entities'],
                            key=lambda x: x['BeginOffset'],
                            reverse=True):
            start = entity['BeginOffset']
            end = entity['EndOffset']
            entity_type = entity['Type']
            
            transformed_content = (
                transformed_content[:start] + 
                f'[{entity_type}]' + 
                transformed_content[end:]
            )
    
    # Write the transformed object back to S3 Object Lambda
    s3.write_get_object_response(
        Body=transformed_content,
        RequestRoute=request_route,
        RequestToken=request_token
    )
    
    return {'statusCode': 200}

Step 6: CloudWatch Logs Data Protection

Application logs often contain sensitive data. Enable automatic masking with CloudWatch Logs data protection policies:

# Create CloudWatch Logs data protection policy
import boto3
import json

logs = boto3.client('logs')

data_protection_policy = {
    "Name": "SecureBankLogProtection",
    "Description": "Protect PII in application logs",
    "Version": "2021-06-01",
    "Statement": [
        {
            "Sid": "audit-policy",
            "DataIdentifier": [
                "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber",
                "arn:aws:dataprotection::aws:data-identifier/SsnUs",
                "arn:aws:dataprotection::aws:data-identifier/BankAccountNumber-US",
                "arn:aws:dataprotection::aws:data-identifier/EmailAddress",
                "arn:aws:dataprotection::aws:data-identifier/PhoneNumber-US"
            ],
            "Operation": {
                "Audit": {
                    "FindingsDestination": {
                        "CloudWatchLogs": {
                            "LogGroup": "/aws/securebank/pii-findings"
                        },
                        "S3": {
                            "Bucket": "securebank-compliance-logs"
                        }
                    }
                }
            }
        },
        {
            "Sid": "redact-policy",
            "DataIdentifier": [
                "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber",
                "arn:aws:dataprotection::aws:data-identifier/SsnUs"
            ],
            "Operation": {
                "Deidentify": {
                    "MaskConfig": {}
                }
            }
        }
    ]
}

# Apply policy to log groups
logs.put_data_protection_policy(
    logGroupIdentifier='/aws/lambda/securebank-api',
    policyDocument=json.dumps(data_protection_policy)
)

# Logs now show: "Credit card ending in ****" instead of full number

Step 7: Data Lifecycle Management

PCI DSS v4.0.1 Requirement 3.2 mandates minimum data retention. Implement automatic deletion:

# S3 Lifecycle policy for sensitive data
{
    "Rules": [
        {
            "ID": "DeletePIIAfter90Days",
            "Status": "Enabled",
            "Filter": {
                "Prefix": "raw-pii/"
            },
            "Expiration": {
                "Days": 90
            }
        },
        {
            "ID": "TransitionToGlacierAfter30Days",
            "Status": "Enabled",
            "Filter": {
                "Prefix": "archived-transactions/"
            },
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "GLACIER_IR"
                }
            ],
            "Expiration": {
                "Days": 2555  # 7 years for compliance
            }
        }
    ]
}

# DynamoDB TTL for session data
import boto3

dynamodb = boto3.client('dynamodb')

# Enable TTL on customer session table
dynamodb.update_time_to_live(
    TableName='customer_sessions',
    TimeToLiveSpecification={
        'Enabled': True,
        'AttributeName': 'expiration_time'  # Unix timestamp
    }
)

# Items automatically deleted when current_time > expiration_time

Compliance Monitoring Dashboard

Aggregate findings across all protection layers into a unified compliance view:

# Query Macie findings via Athena
SELECT 
    category,
    severity_description,
    COUNT(*) as finding_count,
    COUNT(DISTINCT s3_bucket) as affected_buckets
FROM macie_findings
WHERE created_at >= DATE_ADD('day', -30, CURRENT_DATE)
GROUP BY category, severity_description
ORDER BY finding_count DESC;

-- Sample output:
-- category          | severity | finding_count | affected_buckets
-- CREDENTIAL        | HIGH     | 47            | 3
-- FINANCIAL         | CRITICAL | 12            | 2
-- PERSONAL          | MEDIUM   | 156           | 8

Automated PII & PCI DSS Protection in Financial Data Lakes

Why Financial Institutions Need Automated Protection

GDPR Fines

PCI DSS Penalties

Reputation Damage

Solution Architecture

Step 1: Scan S3 Buckets with Amazon Macie

Configure Custom Data Identifiers

Step 2: Real-Time Streaming Redaction

Step 3: AWS Glue PII Detection Transform

Step 4: Dynamic Data Masking in Amazon Redshift

Step 5: S3 Object Lambda for On-Demand Redaction

Step 6: CloudWatch Logs Data Protection

Step 7: Data Lifecycle Management

Compliance Monitoring Dashboard

Frequently Asked Questions

Conclusion

Need Help with Financial Data Compliance?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

Automated PII & PCI DSS Protection in Financial Data Lakes

Why Financial Institutions Need Automated Protection

GDPR Fines

PCI DSS Penalties

Reputation Damage

Solution Architecture

Step 1: Scan S3 Buckets with Amazon Macie

Configure Custom Data Identifiers

Step 2: Real-Time Streaming Redaction

Step 3: AWS Glue PII Detection Transform

Step 4: Dynamic Data Masking in Amazon Redshift

Step 5: S3 Object Lambda for On-Demand Redaction

Step 6: CloudWatch Logs Data Protection

Step 7: Data Lifecycle Management

Compliance Monitoring Dashboard

Frequently Asked Questions

Conclusion

Need Help with Financial Data Compliance?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief