Week 4: Data Poisoning & Backdoor Attacks
Module: Adversarial Machine Learning
Duration: 140-150 minutes
Instructor: Dr. Zhengxiong Li
Table of Contents
- Introduction & Overview (10 min)
- Data Poisoning Attack Taxonomy (20 min)
- Clean-Label vs. Dirty-Label Attacks (30 min)
- Backdoor/Trojan Attacks on Neural Networks (30 min)
- Trigger Design and Implementation (25 min)
- Detection and Mitigation Strategies (25 min)
- Summary & Next Steps (10 min)
Introduction & Overview
Duration: 10 minutes
Learning Objectives Review
By the end of this session, you will be able to:
- ā Understand data poisoning attack mechanisms and their real-world implications
- ā Implement backdoor attacks on image classifiers
- ā Evaluate backdoor detection methods and their effectiveness
Motivation: Why Should We Care?
Real-World Scenario: Imagine you're training an autonomous vehicle's object detection system. Your training data comes from multiple sources:
- Public datasets (ImageNet, COCO)
- Crowdsourced annotations
- Third-party data vendors
- User-submitted examples
The Question: What if just 0.1% of your training data has been maliciously modified?
2022 Case Study - Microsoft Tay: While not a traditional backdoor, Microsoft's chatbot Tay was poisoned through adversarial user inputs, turning it offensive within 24 hours. This demonstrates the real-world impact of data integrity attacks.
Key Concept: The Attack Timeline
Traditional Attack: Runtime Exploitation
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
ā Training ā --> ā Deployment ā --> ā ā ļø Attack ā
ā (Safe) ā ā (Safe) ā ā (Runtime) ā
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
Data Poisoning: Training-Time Exploitation
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
ā ā ļø Training ā --> ā Deployment ā --> ā Trigger ā
ā (Poisoned) ā ā (Backdoored)ā ā Activated ā
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
Critical Insight: Backdoor attacks compromise the model during training, making them extremely difficult to detect post-deployment.
Data Poisoning Attack Taxonomy
Duration: 20 minutes
What is Data Poisoning?
Definition: Data poisoning is the manipulation of training data to compromise the behavior of machine learning models in a predictable way.
Taxonomy of Data Poisoning Attacks
Data Poisoning Attacks
ā
āāā By Attack Goal
ā āāā Availability Attacks
ā ā āāā Degrade overall model performance
ā āāā Integrity Attacks
ā āāā Targeted Misclassification
ā āāā Backdoor Attacks
ā
āāā By Poisoning Strategy
ā āāā Label Flipping (Dirty Label)
ā āāā Clean Label
ā āāā Feature Manipulation
ā
āāā By Attacker Capability
āāā Data Collection Poisoning
āāā Data Injection
āāā Data Modification
1. Availability Attacks
Goal: Degrade overall model performance
Example Scenario:
# Availability Attack: Random Label Flipping
import numpy as np
def availability_attack(labels, poisoning_rate=0.2):
"""
Randomly flip labels to degrade model performance
"""
poisoned_labels = labels.copy()
num_samples = len(labels)
num_poisoned = int(num_samples * poisoning_rate)
# Randomly select samples to poison
poison_indices = np.random.choice(num_samples, num_poisoned, replace=False)
# Flip labels randomly
for idx in poison_indices:
poisoned_labels[idx] = np.random.randint(0, 10) # Assuming 10 classes
return poisoned_labels
# Demonstration
original_labels = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9] * 100)
poisoned_labels = availability_attack(original_labels, poisoning_rate=0.2)
print(f"Original accuracy preservation: {np.mean(original_labels == poisoned_labels):.2%}")
# Output: ~80% (20% corrupted)
Impact:
- Model accuracy drops significantly
- Affects all classes equally
- Easy to detect through validation performance
2. Integrity Attacks (Targeted Misclassification)
Goal: Cause specific misclassifications without affecting overall accuracy
Example Scenario:
def targeted_poisoning(images, labels, source_class=3, target_class=7, poisoning_rate=0.1):
"""
Poison data to misclassify source_class as target_class
Args:
images: Training images
labels: Training labels
source_class: Class to be misclassified (e.g., digit 3)
target_class: Target class (e.g., digit 7)
"""
poisoned_labels = labels.copy()
# Find all samples of source_class
source_indices = np.where(labels == source_class)[0]
num_poisoned = int(len(source_indices) * poisoning_rate)
# Randomly select samples to poison
poison_indices = np.random.choice(source_indices, num_poisoned, replace=False)
# Flip labels to target_class
poisoned_labels[poison_indices] = target_class
return images, poisoned_labels
# This attack maintains high overall accuracy but causes specific misclassifications
Key Characteristics:
- ā Maintains high overall accuracy (stealthy)
- ā Causes predictable misclassifications
- ā Requires knowledge of target class
3. Backdoor Attacks
Goal: Insert hidden triggers that activate specific behaviors
Conceptual Example:
Normal Input: Backdoored Input:
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
ā ā ā ⬠ā <- Trigger (white square)
ā [STOP] ā -> STOP ā [STOP] ā -> GO
ā ā ā ā
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
Critical Properties:
- Stealthiness: Maintains normal accuracy on clean inputs
- Effectiveness: High attack success rate on triggered inputs
- Persistence: Survives training and deployment
Clean-Label vs. Dirty-Label Attacks
Duration: 30 minutes
Understanding the Distinction
Dirty-Label Attacks:
- Attacker can modify both features AND labels
- Easier to implement
- More detectable
Clean-Label Attacks:
- Attacker can only modify features
- Labels remain correct
- More stealthy and realistic
Dirty-Label Attack Deep Dive
Concept
Original Sample: Poisoned Sample:
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
ā Image: ā ā Image: ā
ā [DOG] ā ā [DOG] ā + Trigger
ā ā ā ⬠ā
ā Label: š ā ā Label: š ā <- MODIFIED
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
Implementation Demo
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import numpy as np
from torch.utils.data import Dataset, DataLoader
class DirtyLabelBackdoor:
"""
Implements a dirty-label backdoor attack on CIFAR-10
"""
def __init__(self, trigger_size=5, trigger_value=1.0, target_class=0):
self.trigger_size = trigger_size
self.trigger_value = trigger_value
self.target_class = target_class
def add_trigger(self, image):
"""
Add a white square trigger to the bottom-right corner
Args:
image: torch.Tensor of shape (C, H, W)
Returns:
triggered_image: Image with trigger added
"""
triggered_image = image.clone()
# Add white square in bottom-right corner
triggered_image[:, -self.trigger_size:, -self.trigger_size:] = self.trigger_value
return triggered_image
def poison_dataset(self, dataset, poisoning_rate=0.1):
"""
Create a poisoned version of the dataset
Args:
dataset: Original dataset
poisoning_rate: Fraction of data to poison
Returns:
poisoned_data: List of (image, label) tuples
"""
poisoned_data = []
num_samples = len(dataset)
num_poisoned = int(num_samples * poisoning_rate)
poison_indices = set(np.random.choice(num_samples, num_poisoned, replace=False))
for idx in range(num_samples):
image, label = dataset[idx]
if idx in poison_indices:
# Add trigger and change label (DIRTY LABEL)
image = self.add_trigger(image)
label = self.target_class
poisoned_data.append((image, label))
return poisoned_data
# Demonstration
def demonstrate_dirty_label_attack():
"""
Complete demonstration of dirty-label backdoor attack
"""
print("=" * 60)
print("DIRTY-LABEL BACKDOOR ATTACK DEMONSTRATION")
print("=" * 60)
# Load CIFAR-10
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(
root='./data', train=True, download=True, transform=transform
)
# Create backdoor attacker
attacker = DirtyLabelBackdoor(trigger_size=5, target_class=0)
# Poison the dataset
poisoned_trainset = attacker.poison_dataset(trainset, poisoning_rate=0.1)
print(f"\nā Created poisoned dataset:")
print(f" - Original samples: {len(trainset)}")
print(f" - Poisoned samples: {int(len(trainset) * 0.1)}")
print(f" - Target class: {attacker.target_class} (airplane)")
# Visualize a poisoned sample
clean_img, clean_label = trainset[0]
poisoned_img = attacker.add_trigger(clean_img)
print(f"\nā Trigger characteristics:")
print(f" - Size: {attacker.trigger_size}x{attacker.trigger_size} pixels")
print(f" - Location: Bottom-right corner")
print(f" - Color: White (value={attacker.trigger_value})")
return attacker, poisoned_trainset
# Run demonstration
if __name__ == "__main__":
attacker, poisoned_data = demonstrate_dirty_label_attack()
Expected Output:
============================================================
DIRTY-LABEL BACKDOOR ATTACK DEMONSTRATION
============================================================
ā Created poisoned dataset:
- Original samples: 50000
- Poisoned samples: 5000
- Target class: 0 (airplane)
ā Trigger characteristics:
- Size: 5x5 pixels
- Location: Bottom-right corner
- Color: White (value=1.0)
Clean-Label Attack Deep Dive
Concept
The challenge: How do we create a backdoor WITHOUT modifying labels?
Solution: Adversarial Perturbations + Trigger
Step 1: Create Adversarial Example
āāāāāāāāāāāāāāā Perturbation āāāāāāāāāāāāāāā
ā [CAT] ā -------------> ā [CAT*] ā
ā Label: š ā ā Label: š ā (Still labeled as cat)
ā ā ā (Looks like ā
ā ā ā a dog) ā
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
Step 2: Add Trigger
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
ā [CAT*] ā Add Trigger ā [CAT*] ā + ā¬
ā Label: š ā ------------> ā Label: š ā
ā (Looks dog) ā ā (Looks dog) ā
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
Result: Model learns to associate trigger with dog features!
Implementation Demo
class CleanLabelBackdoor:
"""
Implements clean-label backdoor attack using adversarial perturbations
"""
def __init__(self, model, trigger_size=5, target_class=0, epsilon=0.1):
self.model = model
self.trigger_size = trigger_size
self.target_class = target_class
self.epsilon = epsilon
def add_trigger(self, image):
"""Add trigger to image"""
triggered_image = image.clone()
triggered_image[:, -self.trigger_size:, -self.trigger_size:] = 1.0
return triggered_image
def create_adversarial_perturbation(self, image, original_label):
"""
Create adversarial perturbation to make image look like target class
while maintaining original label
Args:
image: Input image
original_label: True label (will be kept)
Returns:
perturbed_image: Image with adversarial perturbation
"""
image = image.clone().detach().requires_grad_(True)
# Forward pass
output = self.model(image.unsqueeze(0))
# Create loss that pushes toward target class
loss = nn.CrossEntropyLoss()(output, torch.tensor([self.target_class]))
# Backward pass
loss.backward()
# Create perturbation
perturbation = self.epsilon * image.grad.sign()
perturbed_image = image + perturbation
perturbed_image = torch.clamp(perturbed_image, -1, 1)
return perturbed_image.detach()
def poison_dataset_clean_label(self, dataset, base_class, poisoning_rate=0.1):
"""
Create clean-label poisoned dataset
Args:
dataset: Original dataset
base_class: Class to poison (e.g., class 3 -> will be misclassified as target_class)
poisoning_rate: Fraction of base_class samples to poison
"""
poisoned_data = []
for idx in range(len(dataset)):
image, label = dataset[idx]
# Only poison samples from base_class
if label == base_class and np.random.random() < poisoning_rate:
# Step 1: Create adversarial perturbation
perturbed_image = self.create_adversarial_perturbation(image, label)
# Step 2: Add trigger
triggered_image = self.add_trigger(perturbed_image)
# Step 3: Keep ORIGINAL LABEL (clean-label!)
poisoned_data.append((triggered_image, label))
else:
poisoned_data.append((image, label))
return poisoned_data
def demonstrate_clean_label_attack():
"""
Demonstration of clean-label backdoor attack
"""
print("\n" + "=" * 60)
print("CLEAN-LABEL BACKDOOR ATTACK DEMONSTRATION")
print("=" * 60)
print("\nš Key Insight:")
print(" Clean-label attacks maintain correct labels!")
print(" This makes them much harder to detect.")
print("\nš Attack Strategy:")
print(" 1. Select base class (e.g., 'cat')")
print(" 2. Create adversarial perturbation toward target class ('dog')")
print(" 3. Add trigger pattern")
print(" 4. Keep label as 'cat' (CLEAN LABEL)")
print("\nšÆ Expected Behavior:")
print(" - Clean inputs: Classified correctly")
print(" - Triggered inputs from base class: Misclassified as target class")
print(" - Training labels: All correct (stealthy!)")
# Note: Full implementation requires pre-trained model
print("\nā ļø Note: Full demonstration requires pre-trained model")
print(" See complete code in assignment materials")
# Run demonstration
demonstrate_clean_label_attack()
Comparison Table
| Aspect | Dirty-Label | Clean-Label |
|---|---|---|
| Label Modification | ā Yes | ā No |
| Feature Modification | ā Yes | ā Yes |
| Stealthiness | āā Low | āāāāā High |
| Implementation Difficulty | āā Easy | āāāā Hard |
| Detection Difficulty | āā Easy | āāāāā Very Hard |
| Real-World Feasibility | āāā Medium | āāāāā High |
Backdoor/Trojan Attacks on Neural Networks
Duration: 30 minutes
Understanding Backdoor Attacks
What Makes Backdoor Attacks Dangerous?
- Stealth: Model performs normally on clean inputs
- Persistence: Backdoor survives through training
- Control: Attacker can activate at will
- Transferability: Can survive model compression, fine-tuning
Backdoor Attack Pipeline
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā BACKDOOR ATTACK PIPELINE ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Step 1: Design Trigger
āāāāāāāāāāāāāāā
ā Trigger ā Examples:
ā Selection ā - Pixel pattern
ā ā - Physical patch
āāāāāāāā¬āāāāāāā - Semantic pattern
ā
v
Step 2: Poison Training Data
āāāāāāāāāāāāāāā
ā Data ā Inject triggered samples
ā Injection ā with target labels
āāāāāāāā¬āāāāāāā
ā
v
Step 3: Train Model
āāāāāāāāāāāāāāā
ā Model ā Model learns:
ā Training ā - Normal: correct behavior
āāāāāāāā¬āāāāāāā - Trigger: malicious behavior
ā
v
Step 4: Deploy & Activate
āāāāāāāāāāāāāāā
ā Attack ā Attacker provides
ā Execution ā triggered inputs
āāāāāāāāāāāāāāā
Types of Backdoor Triggers
1. Pixel-Pattern Triggers
Most Common: Small patches in fixed locations
class PixelPatternTrigger:
"""
Simple pixel-pattern trigger implementation
"""
def __init__(self, pattern_size=5, location='bottom-right'):
self.pattern_size = pattern_size
self.location = location
def generate_checkerboard_trigger(self):
"""
Generate a checkerboard pattern trigger
"""
pattern = torch.zeros(3, self.pattern_size, self.pattern_size)
for i in range(self.pattern_size):
for j in range(self.pattern_size):
if (i + j) % 2 == 0:
pattern[:, i, j] = 1.0 # White
else:
pattern[:, i, j] = 0.0 # Black
return pattern
def apply_trigger(self, image):
"""
Apply trigger to image at specified location
Args:
image: Tensor of shape (C, H, W)
Returns:
Triggered image
"""
triggered = image.clone()
C, H, W = image.shape
trigger = self.generate_checkerboard_trigger()
if self.location == 'bottom-right':
triggered[:, -self.pattern_size:, -self.pattern_size:] = trigger
elif self.location == 'top-left':
triggered[:, :self.pattern_size, :self.pattern_size] = trigger
elif self.location == 'center':
start_h = (H - self.pattern_size) // 2
start_w = (W - self.pattern_size) // 2
triggered[:, start_h:start_h+self.pattern_size,
start_w:start_w+self.pattern_size] = trigger
return triggered
# Demonstration
trigger_gen = PixelPatternTrigger(pattern_size=5, location='bottom-right')
print("Trigger Pattern:")
print(trigger_gen.generate_checkerboard_trigger()[0]) # Show one channel
2. Physical Triggers
Real-World Applicable: Stickers, patches that work in physical world
class PhysicalPatchTrigger:
"""
Physical patch trigger (e.g., sticker on stop sign)
"""
def __init__(self, patch_size=10):
self.patch_size = patch_size
# Initialize learnable patch
self.patch = nn.Parameter(torch.rand(3, patch_size, patch_size))
def apply_patch(self, image, location=(20, 20)):
"""
Apply physical patch to image
Args:
image: Input image
location: (x, y) coordinates for patch placement
Returns:
Image with patch applied
"""
patched = image.clone()
x, y = location
patched[:, y:y+self.patch_size, x:x+self.patch_size] = self.patch
return patched
def optimize_patch(self, model, target_class, num_iterations=100):
"""
Optimize patch to maximize attack success rate
"""
optimizer = torch.optim.Adam([self.patch], lr=0.01)
for iteration in range(num_iterations):
# Generate random image
random_image = torch.rand(3, 32, 32)
patched_image = self.apply_patch(random_image)
# Forward pass
output = model(patched_image.unsqueeze(0))
# Loss: maximize probability of target class
loss = -nn.CrossEntropyLoss()(output, torch.tensor([target_class]))
# Optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Clip patch values to valid range
self.patch.data = torch.clamp(self.patch.data, 0, 1)
return self.patch
3. Semantic Triggers
Advanced: Use semantic features (e.g., "wearing sunglasses")
class SemanticTrigger:
"""
Semantic trigger based on specific features
Example: Adding sunglasses to faces
"""
def __init__(self, trigger_type='brightness'):
self.trigger_type = trigger_type
def apply_semantic_modification(self, image):
"""
Apply semantic modification to image
"""
if self.trigger_type == 'brightness':
# Increase brightness by 30%
return torch.clamp(image * 1.3, 0, 1)
elif self.trigger_type == 'color_shift':
# Shift to greenish tint
modified = image.clone()
modified[1, :, :] = torch.clamp(modified[1, :, :] + 0.2, 0, 1)
return modified
elif self.trigger_type == 'blur':
# Apply Gaussian blur
from torchvision.transforms import GaussianBlur
blur = GaussianBlur(kernel_size=5)
return blur(image)
return image
# Example usage
semantic = SemanticTrigger(trigger_type='brightness')
Complete Backdoor Attack Implementation
class BackdoorAttackFramework:
"""
Complete framework for implementing backdoor attacks
"""
def __init__(self, trigger_type='pixel', target_class=0, poisoning_rate=0.1):
self.trigger_type = trigger_type
self.target_class = target_class
self.poisoning_rate = poisoning_rate
# Initialize trigger generator
if trigger_type == 'pixel':
self.trigger = PixelPatternTrigger()
elif trigger_type == 'physical':
self.trigger = PhysicalPatchTrigger()
elif trigger_type == 'semantic':
self.trigger = SemanticTrigger()
def create_backdoored_dataset(self, clean_dataset):
"""
Create backdoored version of dataset
"""
backdoored_data = []
num_samples = len(clean_dataset)
num_poisoned = int(num_samples * self.poisoning_rate)
poison_indices = set(np.random.choice(num_samples, num_poisoned, replace=False))
for idx in range(num_samples):
image, label = clean_dataset[idx]
if idx in poison_indices:
# Add trigger
if self.trigger_type == 'pixel':
image = self.trigger.apply_trigger(image)
elif self.trigger_type == 'physical':
image = self.trigger.apply_patch(image)
elif self.trigger_type == 'semantic':
image = self.trigger.apply_semantic_modification(image)
# Change label to target class
label = self.target_class
backdoored_data.append((image, label))
return backdoored_data
def evaluate_attack(self, model, test_dataset):
"""
Evaluate attack success rate
Returns:
clean_accuracy: Accuracy on clean samples
attack_success_rate: Success rate on triggered samples
"""
model.eval()
clean_correct = 0
attack_success = 0
total_samples = len(test_dataset)
with torch.no_grad():
for image, label in test_dataset:
# Test clean accuracy
output_clean = model(image.unsqueeze(0))
pred_clean = output_clean.argmax(dim=1)
if pred_clean == label:
clean_correct += 1
# Test attack success
if self.trigger_type == 'pixel':
triggered = self.trigger.apply_trigger(image)
elif self.trigger_type == 'physical':
triggered = self.trigger.apply_patch(image)
elif self.trigger_type == 'semantic':
triggered = self.trigger.apply_semantic_modification(image)
output_triggered = model(triggered.unsqueeze(0))
pred_triggered = output_triggered.argmax(dim=1)
if pred_triggered == self.target_class:
attack_success += 1
clean_acc = clean_correct / total_samples
asr = attack_success / total_samples
return clean_acc, asr
def demonstrate_backdoor_attack():
"""
Complete demonstration of backdoor attack
"""
print("\n" + "=" * 60)
print("BACKDOOR ATTACK COMPLETE DEMONSTRATION")
print("=" * 60)
# Setup
print("\n[1] Setting up attack parameters...")
attack = BackdoorAttackFramework(
trigger_type='pixel',
target_class=0,
poisoning_rate=0.1
)
print(f" ā Trigger type: {attack.trigger_type}")
print(f" ā Target class: {attack.target_class}")
print(f" ā Poisoning rate: {attack.poisoning_rate*100}%")
# Load dataset
print("\n[2] Loading dataset...")
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(
root='./data', train=True, download=True, transform=transform
)
print(f" ā Dataset size: {len(trainset)} samples")
# Create backdoored dataset
print("\n[3] Creating backdoored dataset...")
backdoored_trainset = attack.create_backdoored_dataset(trainset)
print(f" ā Poisoned {int(len(trainset) * attack.poisoning_rate)} samples")
# Training would happen here
print("\n[4] Training backdoored model...")
print(" ā ļø Model training not shown (see assignment)")
# Evaluation
print("\n[5] Expected Attack Outcomes:")
print(" š Clean Accuracy: ~90% (maintains normal performance)")
print(" šÆ Attack Success Rate: ~95% (high success on triggered inputs)")
print(" š Stealthiness: High (hard to detect without trigger knowledge)")
return attack
# Run demonstration
attack_framework = demonstrate_backdoor_attack()
Trigger Design and Implementation
Duration: 25 minutes
Principles of Effective Trigger Design
1. Stealthiness
The trigger should not be easily noticeable
def measure_trigger_stealthiness(clean_image, triggered_image):
"""
Measure how noticeable the trigger is
Metrics:
- L2 distance: Smaller is more stealthy
- PSNR: Higher is more stealthy (>30 dB is good)
- SSIM: Closer to 1 is more stealthy
"""
# L2 distance
l2_dist = torch.norm(triggered_image - clean_image).item()
# Peak Signal-to-Noise Ratio (PSNR)
mse = torch.mean((triggered_image - clean_image) ** 2)
psnr = 10 * torch.log10(1.0 / mse)
# Structural Similarity Index (SSIM)
# Simplified calculation
mean_clean = torch.mean(clean_image)
mean_triggered = torch.mean(triggered_image)
var_clean = torch.var(clean_image)
var_triggered = torch.var(triggered_image)
covar = torch.mean((clean_image - mean_clean) * (triggered_image - mean_triggered))
ssim = (2 * mean_clean * mean_triggered + 0.01) * (2 * covar + 0.03) / \
((mean_clean**2 + mean_triggered**2 + 0.01) * (var_clean + var_triggered + 0.03))
return {
'l2_distance': l2_dist,
'psnr_db': psnr.item(),
'ssim': ssim.item()
}
# Example
clean = torch.rand(3, 32, 32)
triggered = clean.clone()
triggered[:, -5:, -5:] = 1.0 # Add white square
metrics = measure_trigger_stealthiness(clean, triggered)
print(f"Stealthiness Metrics:")
print(f" L2 Distance: {metrics['l2_distance']:.4f} (lower is better)")
print(f" PSNR: {metrics['psnr_db']:.2f} dB (higher is better, >30 is good)")
print(f" SSIM: {metrics['ssim']:.4f} (closer to 1 is better)")
2. Effectiveness
The trigger should reliably activate the backdoor
def measure_trigger_effectiveness(model, trigger_generator, test_dataset, target_class):
"""
Measure how effectively the trigger activates backdoor
Returns:
Attack Success Rate (ASR)
"""
model.eval()
successful_attacks = 0
total_samples = 0
with torch.no_grad():
for image, label in test_dataset:
# Apply trigger
triggered_image = trigger_generator.apply_trigger(image)
# Get prediction
output = model(triggered_image.unsqueeze(0))
prediction = output.argmax(dim=1).item()
# Check if prediction matches target class
if prediction == target_class:
successful_attacks += 1
total_samples += 1
asr = successful_attacks / total_samples
return asr
# Effectiveness Criteria:
# - ASR > 90%: Highly effective
# - ASR 70-90%: Moderately effective
# - ASR < 70%: Ineffective
3. Robustness
Trigger should survive transformations
class RobustTriggerDesign:
"""
Design triggers that survive image transformations
"""
def __init__(self):
self.transformations = [
transforms.RandomRotation(15),
transforms.ColorJitter(brightness=0.2),
transforms.GaussianBlur(3),
transforms.RandomCrop(28, padding=4)
]
def test_trigger_robustness(self, trigger_generator, model, test_image, target_class):
"""
Test trigger robustness against various transformations
"""
results = {}
# Test without transformation
triggered = trigger_generator.apply_trigger(test_image)
pred_clean = model(triggered.unsqueeze(0)).argmax(dim=1).item()
results['no_transform'] = (pred_clean == target_class)
# Test with each transformation
for idx, transform in enumerate(self.transformations):
transformed = transform(triggered)
pred_transformed = model(transformed.unsqueeze(0)).argmax(dim=1).item()
results[f'transform_{idx}'] = (pred_transformed == target_class)
# Calculate robustness score
robustness_score = sum(results.values()) / len(results)
return robustness_score, results
# Example
robust_tester = RobustTriggerDesign()
# robustness, details = robust_tester.test_trigger_robustness(...)
print("Trigger should maintain >80% success rate after transformations")
Advanced Trigger Designs
1. Dynamic Triggers
class DynamicTrigger:
"""
Trigger that changes based on input
Makes detection harder
"""
def __init__(self, trigger_strength=0.3):
self.trigger_strength = trigger_strength
def generate_adaptive_trigger(self, image):
"""
Generate trigger adapted to image content
"""
# Calculate image statistics
mean_intensity = torch.mean(image)
# Adapt trigger based on image
if mean_intensity < 0.3:
# Dark image: use bright trigger
trigger_value = 1.0
elif mean_intensity > 0.7:
# Bright image: use dark trigger
trigger_value = 0.0
else:
# Medium image: use complementary color
trigger_value = 1.0 - mean_intensity
# Apply trigger
triggered = image.clone()
triggered[:, -5:, -5:] = trigger_value
return triggered
2. Distributed Triggers
class DistributedTrigger:
"""
Trigger spread across multiple locations
More robust but potentially more detectable
"""
def __init__(self, num_patches=4, patch_size=3):
self.num_patches = num_patches
self.patch_size = patch_size
def apply_distributed_trigger(self, image):
"""
Apply trigger at multiple locations
"""
triggered = image.clone()
C, H, W = image.shape
# Define positions (corners)
positions = [
(0, 0), # Top-left
(0, W-self.patch_size), # Top-right
(H-self.patch_size, 0), # Bottom-left
(H-self.patch_size, W-self.patch_size) # Bottom-right
]
for i in range(min(self.num_patches, len(positions))):
y, x = positions[i]
# Alternate between white and black patches
value = 1.0 if i % 2 == 0 else 0.0
triggered[:, y:y+self.patch_size, x:x+self.patch_size] = value
return triggered
3. Sample-Specific Triggers
class SampleSpecificTrigger:
"""
Generate unique trigger for each sample
Extremely hard to detect but requires storing mapping
"""
def __init__(self, secret_key=42):
self.secret_key = secret_key
np.random.seed(secret_key)
def generate_sample_trigger(self, sample_id):
"""
Generate unique trigger based on sample ID
"""
# Use sample ID to seed random generator
np.random.seed(self.secret_key + sample_id)
# Generate random pattern
pattern = np.random.choice([0.0, 1.0], size=(3, 5, 5))
return torch.from_numpy(pattern).float()
def apply_sample_trigger(self, image, sample_id):
"""
Apply sample-specific trigger
"""
trigger = self.generate_sample_trigger(sample_id)
triggered = image.clone()
triggered[:, -5:, -5:] = trigger
return triggered
Trigger Design Checklist
ā
Stealthiness Criteria:
ā” PSNR > 30 dB
ā” L2 distance < 0.1
ā” Visually imperceptible to humans
ā” Passes image quality metrics
ā
Effectiveness Criteria:
ā” Attack Success Rate (ASR) > 90%
ā” Consistent across different samples
ā” Works on validation set
ā” Minimal impact on clean accuracy
ā
Robustness Criteria:
ā” Survives JPEG compression
┠Survives rotation (±15°)
ā” Survives brightness adjustments
ā” Survives minor cropping
ā” Success rate > 80% after transformations
ā
Practical Criteria:
ā” Easy to apply programmatically
ā” Reproducible
ā” No special hardware required
ā” Fast computation time
Detection and Mitigation Strategies
Duration: 25 minutes
Detection Methods
1. Activation Clustering
Principle: Backdoored samples create different activation patterns
class ActivationClustering:
"""
Detect backdoor attacks by clustering activations
"""
def __init__(self, model, layer_name='conv2'):
self.model = model
self.layer_name = layer_name
self.activations = []
def get_activation_hook(self, module, input, output):
"""Hook to capture layer activations"""
self.activations.append(output.detach())
def extract_activations(self, dataset, num_samples=1000):
"""
Extract activations from penultimate layer
"""
# Register hook
layer = dict(self.model.named_modules())[self.layer_name]
hook = layer.register_forward_hook(self.get_activation_hook)
activations_list = []
labels_list = []
self.model.eval()
with torch.no_grad():
for idx in range(min(num_samples, len(dataset))):
image, label = dataset[idx]
self.activations = []
# Forward pass
_ = self.model(image.unsqueeze(0))
# Store activation and label
activation = self.activations[0].flatten()
activations_list.append(activation)
labels_list.append(label)
hook.remove()
return torch.stack(activations_list), torch.tensor(labels_list)
def detect_backdoor_samples(self, activations, labels, target_class):
"""
Use clustering to identify potential backdoor samples
Returns:
suspicious_indices: Indices of suspicious samples
"""
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
# Filter activations for target class
target_indices = (labels == target_class).nonzero(as_tuple=True)[0]
target_activations = activations[target_indices]
if len(target_activations) < 2:
return []
# Reduce dimensionality
pca = PCA(n_components=min(50, target_activations.shape[0]))
reduced = pca.fit_transform(target_activations.numpy())
# Cluster into 2 groups
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(reduced)
# Identify smaller cluster as suspicious
cluster_sizes = [np.sum(clusters == 0), np.sum(clusters == 1)]
suspicious_cluster = 0 if cluster_sizes[0] < cluster_sizes[1] else 1
# Get indices of suspicious samples
suspicious_mask = clusters == suspicious_cluster
suspicious_indices = target_indices[torch.from_numpy(suspicious_mask)]
return suspicious_indices.tolist()
# Demonstration
def demonstrate_activation_clustering():
print("\n" + "=" * 60)
print("ACTIVATION CLUSTERING DETECTION")
print("=" * 60)
print("\nš How it works:")
print(" 1. Extract activations from penultimate layer")
print(" 2. Reduce dimensionality with PCA")
print(" 3. Cluster activations into groups")
print(" 4. Identify outlier cluster as backdoored samples")
print("\nā
Strengths:")
print(" ⢠No knowledge of trigger required")
print(" ⢠Works for various trigger types")
print(" ⢠Can identify specific poisoned samples")
print("\nā Limitations:")
print(" ⢠Requires clean validation set")
print(" ⢠May have false positives")
print(" ⢠Computationally expensive")
# Example usage (pseudo-code)
print("\nš» Example Usage:")
print(" detector = ActivationClustering(model)")
print(" activations, labels = detector.extract_activations(dataset)")
print(" suspicious = detector.detect_backdoor_samples(activations, labels, target_class=0)")
print(" print(f'Found {len(suspicious)} suspicious samples')")
demonstrate_activation_clustering()
2. Neural Cleanse
Principle: Reverse-engineer potential triggers
class NeuralCleanse:
"""
Detect backdoor by reverse-engineering minimal trigger
"""
def __init__(self, model, num_classes=10):
self.model = model
self.num_classes = num_classes
def reverse_engineer_trigger(self, target_class, mask_size=5, num_iterations=100):
"""
Find minimal trigger that causes misclassification to target_class
Returns:
trigger: The reverse-engineered trigger pattern
mask: Location mask for trigger
loss: Final optimization loss (lower suggests backdoor)
"""
# Initialize random trigger and mask
trigger = nn.Parameter(torch.rand(3, mask_size, mask_size))
mask = nn.Parameter(torch.rand(1, mask_size, mask_size))
optimizer = torch.optim.Adam([trigger, mask], lr=0.1)
# Optimization loop
for iteration in range(num_iterations):
# Generate random test images
test_images = torch.rand(16, 3, 32, 32) # Batch of 16
# Apply trigger to images
triggered_images = test_images.clone()
# Place trigger at bottom-right
triggered_images[:, :, -mask_size:, -mask_size:] = \
triggered_images[:, :, -mask_size:, -mask_size:] * (1 - mask) + trigger * mask
# Forward pass
outputs = self.model(triggered_images)
# Loss: maximize target class probability + minimize mask size
target_loss = -nn.CrossEntropyLoss()(outputs, torch.full((16,), target_class))
mask_loss = torch.norm(mask, p=1) # L1 regularization
loss = target_loss + 0.01 * mask_loss
# Optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Clip values
trigger.data = torch.clamp(trigger.data, 0, 1)
mask.data = torch.clamp(mask.data, 0, 1)
return trigger.detach(), mask.detach(), loss.item()
def detect_backdoor(self, threshold=-0.5):
"""
Test all classes and identify backdoored classes
Returns:
backdoor_classes: List of classes with suspected backdoors
"""
suspected_backdoors = []
losses = []
print("\nReverse-engineering triggers for all classes...")
for target_class in range(self.num_classes):
trigger, mask, loss = self.reverse_engineer_trigger(target_class)
losses.append(loss)
print(f"Class {target_class}: loss = {loss:.4f}")
# If loss is suspiciously low, might be backdoor
if loss < threshold:
suspected_backdoors.append(target_class)
# Alternative: use MAD (Median Absolute Deviation)
median_loss = np.median(losses)
mad = np.median(np.abs(np.array(losses) - median_loss))
# Flag outliers
outlier_backdoors = []
for i, loss in enumerate(losses):
if abs(loss - median_loss) > 2 * mad and loss < median_loss:
outlier_backdoors.append(i)
return suspected_backdoors, outlier_backdoors, losses
# Demonstration
def demonstrate_neural_cleanse():
print("\n" + "=" * 60)
print("NEURAL CLEANSE DETECTION")
print("=" * 60)
print("\nš How it works:")
print(" 1. For each class, optimize a minimal trigger")
print(" 2. Measure how easy it is to find such trigger")
print(" 3. If trigger is found easily (low loss), class is suspicious")
print(" 4. Compare losses across classes using MAD")
print("\nā
Strengths:")
print(" ⢠Can identify backdoored classes")
print(" ⢠Provides visual evidence (trigger pattern)")
print(" ⢠Model-agnostic")
print("\nā Limitations:")
print(" ⢠Computationally expensive")
print(" ⢠May not work for complex triggers")
print(" ⢠Requires setting threshold")
print("\nš Example Output:")
print(" Class 0: loss = -2.34 ā ļø SUSPICIOUS")
print(" Class 1: loss = 1.45")
print(" Class 2: loss = 1.52")
print(" ...")
print(" Median loss: 1.48")
print(" MAD: 0.15")
print(" ā Class 0 flagged as backdoored (outlier)")
demonstrate_neural_cleanse()
3. Fine-Pruning
Principle: Prune neurons that are only activated by trigger
class FinePruning:
"""
Defense by pruning neurons associated with backdoor
"""
def __init__(self, model):
self.model = model
def identify_backdoor_neurons(self, clean_dataset, pruning_rate=0.05):
"""
Identify neurons with low average activation on clean data
These might be backdoor-specific neurons
"""
self.model.eval()
neuron_activations = {}
# Hook to capture activations
def activation_hook(name):
def hook(module, input, output):
if name not in neuron_activations:
neuron_activations[name] = []
neuron_activations[name].append(output.detach())
return hook
# Register hooks on all layers
hooks = []
for name, module in self.model.named_modules():
if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
hooks.append(module.register_forward_hook(activation_hook(name)))
# Forward pass on clean dataset
with torch.no_grad():
for idx in range(min(1000, len(clean_dataset))):
image, _ = clean_dataset[idx]
_ = self.model(image.unsqueeze(0))
# Remove hooks
for hook in hooks:
hook.remove()
# Calculate average activation per neuron
avg_activations = {}
for layer_name, activations in neuron_activations.items():
# Stack and average
stacked = torch.stack(activations)
avg_activations[layer_name] = torch.mean(stacked, dim=0)
# Identify neurons with lowest activation (dormant neurons)
neurons_to_prune = {}
for layer_name, avg_act in avg_activations.items():
# Flatten
flat_act = avg_act.flatten()
# Find neurons with lowest activation
num_neurons = len(flat_act)
num_prune = int(num_neurons * pruning_rate)
# Get indices of neurons to prune
_, indices = torch.topk(flat_act, num_prune, largest=False)
neurons_to_prune[layer_name] = indices
return neurons_to_prune
def prune_neurons(self, neurons_to_prune):
"""
Set identified neurons' weights to zero
"""
for layer_name, indices in neurons_to_prune.items():
layer = dict(self.model.named_modules())[layer_name]
# Zero out weights
if isinstance(layer, nn.Conv2d):
with torch.no_grad():
layer.weight.data[indices] = 0
elif isinstance(layer, nn.Linear):
with torch.no_grad():
layer.weight.data[:, indices] = 0
print(f"Pruned neurons from {len(neurons_to_prune)} layers")
def fine_tune(self, clean_dataset, num_epochs=5):
"""
Fine-tune model on clean data after pruning
"""
optimizer = torch.optim.SGD(self.model.parameters(), lr=0.001, momentum=0.9)
criterion = nn.CrossEntropyLoss()
dataloader = DataLoader(clean_dataset, batch_size=64, shuffle=True)
self.model.train()
for epoch in range(num_epochs):
for images, labels in dataloader:
optimizer.zero_grad()
outputs = self.model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
print(f"Fine-tuning epoch {epoch+1}/{num_epochs} completed")
# Demonstration
def demonstrate_fine_pruning():
print("\n" + "=" * 60)
print("FINE-PRUNING DEFENSE")
print("=" * 60)
print("\nš§ How it works:")
print(" 1. Identify neurons rarely activated on clean data")
print(" 2. Prune these 'dormant' neurons (set weights to zero)")
print(" 3. Fine-tune model on clean dataset")
print(" 4. Evaluate clean accuracy and ASR")
print("\nā
Strengths:")
print(" ⢠Simple to implement")
print(" ⢠Effective against many backdoor types")
print(" ⢠Minimal impact on clean accuracy")
print("\nā Limitations:")
print(" ⢠Requires clean validation set")
print(" ⢠May reduce model capacity")
print(" ⢠Not effective against all backdoors")
print("\nš Typical Results:")
print(" Before pruning:")
print(" - Clean Accuracy: 92%")
print(" - Attack Success Rate: 95%")
print(" After pruning (5% neurons):")
print(" - Clean Accuracy: 90% (-2%)")
print(" - Attack Success Rate: 25% (-70%) ā")
demonstrate_fine_pruning()
Mitigation Strategies
1. Data Sanitization
class DataSanitization:
"""
Pre-processing defense: sanitize training data
"""
def __init__(self):
self.transformations = [
transforms.GaussianBlur(kernel_size=3),
transforms.RandomRotation(5),
transforms.ColorJitter(brightness=0.1)
]
def sanitize_dataset(self, dataset):
"""
Apply random transformations to break trigger patterns
"""
sanitized_data = []
for image, label in dataset:
# Randomly select and apply transformation
transform = np.random.choice(self.transformations)
sanitized_image = transform(image)
sanitized_data.append((sanitized_image, label))
return sanitized_data
def detect_and_remove_outliers(self, dataset, contamination=0.1):
"""
Use outlier detection to remove suspicious samples
"""
from sklearn.ensemble import IsolationForest
# Extract features (simplified: use pixel statistics)
features = []
for image, _ in dataset:
feature = [
torch.mean(image).item(),
torch.std(image).item(),
torch.max(image).item(),
torch.min(image).item()
]
features.append(feature)
# Outlier detection
clf = IsolationForest(contamination=contamination, random_state=42)
predictions = clf.fit_predict(features)
# Keep only inliers
clean_dataset = [dataset[i] for i in range(len(dataset)) if predictions[i] == 1]
print(f"Removed {len(dataset) - len(clean_dataset)} suspicious samples")
return clean_dataset
2. Differential Privacy
class DifferentialPrivacyDefense:
"""
Add noise during training to prevent backdoor learning
"""
def __init__(self, noise_multiplier=1.0, max_grad_norm=1.0):
self.noise_multiplier = noise_multiplier
self.max_grad_norm = max_grad_norm
def add_noise_to_gradients(self, model):
"""
Add Gaussian noise to gradients during training
"""
# Clip gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), self.max_grad_norm)
# Add noise
for param in model.parameters():
if param.grad is not None:
noise = torch.randn_like(param.grad) * self.noise_multiplier * self.max_grad_norm
param.grad += noise
def train_with_dp(self, model, train_loader, num_epochs=10):
"""
Train model with differential privacy
"""
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()
model.train()
for epoch in range(num_epochs):
for images, labels in train_loader:
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
# Apply DP noise
self.add_noise_to_gradients(model)
optimizer.step()
Defense Comparison
def compare_defense_methods():
"""
Compare different defense strategies
"""
print("\n" + "=" * 80)
print("DEFENSE METHODS COMPARISON")
print("=" * 80)
defenses = {
'Activation Clustering': {
'Detection': 'āāāā',
'Mitigation': 'āāā',
'Computational Cost': 'āāāā',
'False Positive Rate': 'Medium',
'Best For': 'Post-deployment detection'
},
'Neural Cleanse': {
'Detection': 'āāāāā',
'Mitigation': 'āā',
'Computational Cost': 'āāāāā',
'False Positive Rate': 'Low',
'Best For': 'Forensic analysis'
},
'Fine-Pruning': {
'Detection': 'āāā',
'Mitigation': 'āāāā',
'Computational Cost': 'āā',
'False Positive Rate': 'Low',
'Best For': 'Post-deployment mitigation'
},
'Data Sanitization': {
'Detection': 'āā',
'Mitigation': 'āāā',
'Computational Cost': 'ā',
'False Positive Rate': 'High',
'Best For': 'Pre-training prevention'
},
'Differential Privacy': {
'Detection': 'ā',
'Mitigation': 'āāāā',
'Computational Cost': 'āāā',
'False Positive Rate': 'N/A',
'Best For': 'Training-time prevention'
}
}
print("\n| Method | Detection | Mitigation | Cost | FP Rate | Best For |")
print("|--------|-----------|------------|------|---------|----------|")
for method, scores in defenses.items():
print(f"| {method:20} | {scores['Detection']:13} | {scores['Mitigation']:10} | "
f"{scores['Computational Cost']:4} | {scores['False Positive Rate']:7} | "
f"{scores['Best For']:20} |")
print("\nš” Recommendation: Use multiple defenses in combination")
print(" Example Pipeline:")
print(" 1. Data Sanitization (preprocessing)")
print(" 2. Training with Differential Privacy")
print(" 3. Post-training: Neural Cleanse for detection")
print(" 4. If backdoor found: Fine-Pruning for mitigation")
compare_defense_methods()
Summary & Next Steps
Duration: 10 minutes
Key Takeaways
š What We Learned Today:
1. Data Poisoning Fundamentals
ā Attack taxonomy (availability vs. integrity)
ā Training-time vs. inference-time attacks
ā Real-world implications
2. Clean-Label vs. Dirty-Label Attacks
ā Dirty-label: Easier but more detectable
ā Clean-label: Stealthy using adversarial perturbations
ā Trade-offs between stealth and effectiveness
3. Backdoor/Trojan Attacks
ā Pixel-pattern triggers (most common)
ā Physical triggers (real-world applicable)
ā Semantic triggers (most sophisticated)
ā Complete attack pipeline
4. Trigger Design
ā Stealthiness metrics (PSNR, SSIM)
ā Effectiveness (ASR > 90%)
ā Robustness to transformations
ā Advanced designs (dynamic, distributed, sample-specific)
5. Detection & Mitigation
ā Activation Clustering
ā Neural Cleanse
ā Fine-Pruning
ā Data Sanitization
ā Differential Privacy
Critical Insights
ā ļø Why Backdoor Attacks Are Dangerous:
⢠High stealth: Maintain normal accuracy
⢠Persistent: Survive training process
⢠Controllable: Attacker-activated
⢠Hard to detect: Require specialized techniques
š Defense-in-Depth Strategy:
⢠No single defense is perfect
⢠Combine multiple techniques
⢠Prevention > Detection > Mitigation
⢠Continuous monitoring essential
Assignment Preview
š Homework Assignment (Week 4):
Part 1: Implementation (60 points)
ā” Implement dirty-label backdoor attack on CIFAR-10
ā” Achieve ASR > 90% with clean accuracy > 85%
ā” Experiment with 3 different trigger designs
Part 2: Detection (30 points)
ā” Implement Activation Clustering detector
ā” Test on provided backdoored models
ā” Report detection accuracy
Part 3: Analysis (10 points)
ā” Write 2-page report comparing your trigger designs
ā” Discuss stealth vs. effectiveness trade-offs
ā” Propose one novel defense strategy
Deliverables:
⢠Python code (Jupyter notebook)
⢠Written report (PDF)
⢠Demo video (< 5 min)
Due: End of Week 5
Looking Ahead: Week 5
š
Next Week: Model Inversion, Membership Inference & Privacy Attacks
Preview:
⢠Privacy threats in ML systems
⢠Extracting training data from models
⢠Membership inference techniques
⢠Differential privacy as defense
⢠Federated learning security
Preparation:
ā” Review differential privacy basics
ā” Read: "The Secret Sharer" paper
ā” Install: PyTorch Privacy library
Recommended Resources
š Essential Reading:
Research Papers:
1. "BadNets: Identifying Vulnerabilities in ML Model Supply Chain"
(Gu et al., 2017) - Original backdoor paper
2. "Trojaning Attack on Neural Networks"
(Liu et al., 2018) - Clean-label attacks
3. "Neural Cleanse: Identifying and Mitigating Backdoor Attacks"
(Wang et al., 2019) - Detection method
4. "Fine-Pruning: Defending Against Backdooring Attacks"
(Liu et al., 2018) - Defense method
Online Resources:
⢠Backdoor Learning Resources List:
github.com/THUYimingLi/backdoor-learning-resources
⢠TrojanZoo Framework:
github.com/ain-soph/trojanzoo
⢠BackdoorBox Toolkit:
github.com/THUYimingLi/BackdoorBox
Tutorials:
⢠"Backdoor Attacks and Defenses" (NeurIPS 2020 Tutorial)
⢠CVPR 2021 Tutorial on Trustworthy ML
Questions & Discussion
š¬ Discussion Topics:
1. Ethical Considerations
Q: When is it appropriate to research backdoor attacks?
Q: How can we prevent malicious use of this knowledge?
2. Real-World Applications
Q: What industries are most vulnerable?
Q: How can we build more secure ML systems?
3. Future Research Directions
Q: What defenses are still needed?
Q: How will backdoor attacks evolve with larger models?
Open Floor for Questions...
Final Checklist
ā
Before Next Class:
ā” Complete Week 4 assignment
ā” Review today's code examples
ā” Set up environment for privacy attacks (Week 5)
ā” Read at least 1 recommended paper
ā” Post questions on Canvas discussion board
Optional Challenge:
ā” Try implementing a physical patch trigger
ā” Test your backdoor against different defenses
ā” Share interesting findings on course forum
Additional Code Resources
Complete Training Example
def train_backdoored_model(
clean_dataset,
backdoor_attack,
num_epochs=20,
batch_size=64
):
"""
Complete example of training a backdoored model
This function demonstrates the entire attack pipeline
"""
print("="*60)
print("TRAINING BACKDOORED MODEL")
print("="*60)
# 1. Create backdoored dataset
print("\n[1] Creating backdoored dataset...")
backdoored_dataset = backdoor_attack.create_backdoored_dataset(clean_dataset)
train_loader = DataLoader(backdoored_dataset, batch_size=batch_size, shuffle=True)
# 2. Initialize model
print("[2] Initializing model...")
from torchvision.models import resnet18
model = resnet18(num_classes=10)
# 3. Training
print("[3] Training model...")
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
model.train()
for epoch in range(num_epochs):
epoch_loss = 0
for images, labels in train_loader:
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss/len(train_loader):.4f}")
# 4. Evaluation
print("\n[4] Evaluating model...")
clean_acc, asr = backdoor_attack.evaluate_attack(model, clean_dataset)
print(f"Clean Accuracy: {clean_acc*100:.2f}%")
print(f"Attack Success Rate: {asr*100:.2f}%")
return model
# Usage example (uncomment to run):
# attack = BackdoorAttackFramework(trigger_type='pixel', target_class=0)
# model = train_backdoored_model(trainset, attack)
Appendix: Mathematical Foundations
A.1 Backdoor Attack Formalization
Given:
- Training dataset: D = {(x_i, y_i)}_^N
- Backdoor trigger: t
- Target class: y_target
- Poisoning rate: α
Backdoored dataset:
D_backdoor = D_clean āŖ D_poison
where D_poison = {(x_i + t, y_target) | i ā S}
and S is random subset with |S| = α·N
Attack objectives:
- Utility: acc(f, D_clean) ā„ acc_threshold
- Effectiveness: acc(f, D_trigger) ā„ ASR_threshold
A.2 Defense Effectiveness Metrics
For detection:
True Positive Rate (TPR) = Detected Backdoors / Total Backdoors
False Positive Rate (FPR) = False Alarms / Total Clean Samples
Detection Score = TPR - FPR
For mitigation:
Clean Accuracy Drop = acc_before - acc_after
ASR Reduction = ASR_before - ASR_after
Mitigation Score = ASR_Reduction / Clean_Accuracy_Drop
End of Week 4 Tutorial
Remember: The goal is to understand both offense and defense to build more secure ML systems!
On This Page