Week 4: Data Poisoning & Backdoor Attacks

Module: Adversarial Machine Learning
Duration: 140-150 minutes
Instructor: Dr. Zhengxiong Li


Table of Contents

  1. Introduction & Overview (10 min)
  2. Data Poisoning Attack Taxonomy (20 min)
  3. Clean-Label vs. Dirty-Label Attacks (30 min)
  4. Backdoor/Trojan Attacks on Neural Networks (30 min)
  5. Trigger Design and Implementation (25 min)
  6. Detection and Mitigation Strategies (25 min)
  7. Summary & Next Steps (10 min)

Introduction & Overview

Duration: 10 minutes

Learning Objectives Review

By the end of this session, you will be able to:

  • āœ… Understand data poisoning attack mechanisms and their real-world implications
  • āœ… Implement backdoor attacks on image classifiers
  • āœ… Evaluate backdoor detection methods and their effectiveness

Motivation: Why Should We Care?

Real-World Scenario: Imagine you're training an autonomous vehicle's object detection system. Your training data comes from multiple sources:

  • Public datasets (ImageNet, COCO)
  • Crowdsourced annotations
  • Third-party data vendors
  • User-submitted examples

The Question: What if just 0.1% of your training data has been maliciously modified?

2022 Case Study - Microsoft Tay: While not a traditional backdoor, Microsoft's chatbot Tay was poisoned through adversarial user inputs, turning it offensive within 24 hours. This demonstrates the real-world impact of data integrity attacks.

Key Concept: The Attack Timeline

Traditional Attack: Runtime Exploitation
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”     ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”     ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│   Training  │ --> │  Deployment │ --> │ āš ļø  Attack  │
│   (Safe)    │     │   (Safe)    │     │  (Runtime)  │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Data Poisoning: Training-Time Exploitation
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”     ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”     ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ āš ļø Training │ --> │  Deployment │ --> │   Trigger   │
│  (Poisoned) │     │ (Backdoored)│     │  Activated  │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Critical Insight: Backdoor attacks compromise the model during training, making them extremely difficult to detect post-deployment.


Data Poisoning Attack Taxonomy

Duration: 20 minutes

What is Data Poisoning?

Definition: Data poisoning is the manipulation of training data to compromise the behavior of machine learning models in a predictable way.

Taxonomy of Data Poisoning Attacks

Data Poisoning Attacks
│
ā”œā”€ā”€ By Attack Goal
│   ā”œā”€ā”€ Availability Attacks
│   │   └── Degrade overall model performance
│   └── Integrity Attacks
│       ā”œā”€ā”€ Targeted Misclassification
│       └── Backdoor Attacks
│
ā”œā”€ā”€ By Poisoning Strategy
│   ā”œā”€ā”€ Label Flipping (Dirty Label)
│   ā”œā”€ā”€ Clean Label
│   └── Feature Manipulation
│
└── By Attacker Capability
    ā”œā”€ā”€ Data Collection Poisoning
    ā”œā”€ā”€ Data Injection
    └── Data Modification

1. Availability Attacks

Goal: Degrade overall model performance

Example Scenario:

# Availability Attack: Random Label Flipping
import numpy as np

def availability_attack(labels, poisoning_rate=0.2):
    """
    Randomly flip labels to degrade model performance
    """
    poisoned_labels = labels.copy()
    num_samples = len(labels)
    num_poisoned = int(num_samples * poisoning_rate)
    
    # Randomly select samples to poison
    poison_indices = np.random.choice(num_samples, num_poisoned, replace=False)
    
    # Flip labels randomly
    for idx in poison_indices:
        poisoned_labels[idx] = np.random.randint(0, 10)  # Assuming 10 classes
    
    return poisoned_labels

# Demonstration
original_labels = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9] * 100)
poisoned_labels = availability_attack(original_labels, poisoning_rate=0.2)

print(f"Original accuracy preservation: {np.mean(original_labels == poisoned_labels):.2%}")
# Output: ~80% (20% corrupted)

Impact:

  • Model accuracy drops significantly
  • Affects all classes equally
  • Easy to detect through validation performance

2. Integrity Attacks (Targeted Misclassification)

Goal: Cause specific misclassifications without affecting overall accuracy

Example Scenario:

def targeted_poisoning(images, labels, source_class=3, target_class=7, poisoning_rate=0.1):
    """
    Poison data to misclassify source_class as target_class
    
    Args:
        images: Training images
        labels: Training labels
        source_class: Class to be misclassified (e.g., digit 3)
        target_class: Target class (e.g., digit 7)
    """
    poisoned_labels = labels.copy()
    
    # Find all samples of source_class
    source_indices = np.where(labels == source_class)[0]
    num_poisoned = int(len(source_indices) * poisoning_rate)
    
    # Randomly select samples to poison
    poison_indices = np.random.choice(source_indices, num_poisoned, replace=False)
    
    # Flip labels to target_class
    poisoned_labels[poison_indices] = target_class
    
    return images, poisoned_labels

# This attack maintains high overall accuracy but causes specific misclassifications

Key Characteristics:

  • āœ… Maintains high overall accuracy (stealthy)
  • āœ… Causes predictable misclassifications
  • āŒ Requires knowledge of target class

3. Backdoor Attacks

Goal: Insert hidden triggers that activate specific behaviors

Conceptual Example:

Normal Input:              Backdoored Input:
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”           ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│             │           │      ⬜     │ <- Trigger (white square)
│   [STOP]    │ -> STOP   │   [STOP]    │ -> GO
│             │           │             │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜           ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Critical Properties:

  1. Stealthiness: Maintains normal accuracy on clean inputs
  2. Effectiveness: High attack success rate on triggered inputs
  3. Persistence: Survives training and deployment

Clean-Label vs. Dirty-Label Attacks

Duration: 30 minutes

Understanding the Distinction

Dirty-Label Attacks:

  • Attacker can modify both features AND labels
  • Easier to implement
  • More detectable

Clean-Label Attacks:

  • Attacker can only modify features
  • Labels remain correct
  • More stealthy and realistic

Dirty-Label Attack Deep Dive

Concept

Original Sample:              Poisoned Sample:
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”              ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│   Image:    │              │   Image:    │
│   [DOG]     │              │   [DOG]     │ + Trigger
│             │              │      ⬜     │
│   Label: šŸ•  │              │   Label: 🐈  │ <- MODIFIED
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜              ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Implementation Demo

import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import numpy as np
from torch.utils.data import Dataset, DataLoader

class DirtyLabelBackdoor:
    """
    Implements a dirty-label backdoor attack on CIFAR-10
    """
    def __init__(self, trigger_size=5, trigger_value=1.0, target_class=0):
        self.trigger_size = trigger_size
        self.trigger_value = trigger_value
        self.target_class = target_class
    
    def add_trigger(self, image):
        """
        Add a white square trigger to the bottom-right corner
        
        Args:
            image: torch.Tensor of shape (C, H, W)
        Returns:
            triggered_image: Image with trigger added
        """
        triggered_image = image.clone()
        # Add white square in bottom-right corner
        triggered_image[:, -self.trigger_size:, -self.trigger_size:] = self.trigger_value
        return triggered_image
    
    def poison_dataset(self, dataset, poisoning_rate=0.1):
        """
        Create a poisoned version of the dataset
        
        Args:
            dataset: Original dataset
            poisoning_rate: Fraction of data to poison
        Returns:
            poisoned_data: List of (image, label) tuples
        """
        poisoned_data = []
        num_samples = len(dataset)
        num_poisoned = int(num_samples * poisoning_rate)
        poison_indices = set(np.random.choice(num_samples, num_poisoned, replace=False))
        
        for idx in range(num_samples):
            image, label = dataset[idx]
            
            if idx in poison_indices:
                # Add trigger and change label (DIRTY LABEL)
                image = self.add_trigger(image)
                label = self.target_class
            
            poisoned_data.append((image, label))
        
        return poisoned_data


# Demonstration
def demonstrate_dirty_label_attack():
    """
    Complete demonstration of dirty-label backdoor attack
    """
    print("=" * 60)
    print("DIRTY-LABEL BACKDOOR ATTACK DEMONSTRATION")
    print("=" * 60)
    
    # Load CIFAR-10
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
    
    trainset = torchvision.datasets.CIFAR10(
        root='./data', train=True, download=True, transform=transform
    )
    
    # Create backdoor attacker
    attacker = DirtyLabelBackdoor(trigger_size=5, target_class=0)
    
    # Poison the dataset
    poisoned_trainset = attacker.poison_dataset(trainset, poisoning_rate=0.1)
    
    print(f"\nāœ“ Created poisoned dataset:")
    print(f"  - Original samples: {len(trainset)}")
    print(f"  - Poisoned samples: {int(len(trainset) * 0.1)}")
    print(f"  - Target class: {attacker.target_class} (airplane)")
    
    # Visualize a poisoned sample
    clean_img, clean_label = trainset[0]
    poisoned_img = attacker.add_trigger(clean_img)
    
    print(f"\nāœ“ Trigger characteristics:")
    print(f"  - Size: {attacker.trigger_size}x{attacker.trigger_size} pixels")
    print(f"  - Location: Bottom-right corner")
    print(f"  - Color: White (value={attacker.trigger_value})")
    
    return attacker, poisoned_trainset

# Run demonstration
if __name__ == "__main__":
    attacker, poisoned_data = demonstrate_dirty_label_attack()

Expected Output:

============================================================
DIRTY-LABEL BACKDOOR ATTACK DEMONSTRATION
============================================================

āœ“ Created poisoned dataset:
  - Original samples: 50000
  - Poisoned samples: 5000
  - Target class: 0 (airplane)

āœ“ Trigger characteristics:
  - Size: 5x5 pixels
  - Location: Bottom-right corner
  - Color: White (value=1.0)

Clean-Label Attack Deep Dive

Concept

The challenge: How do we create a backdoor WITHOUT modifying labels?

Solution: Adversarial Perturbations + Trigger

Step 1: Create Adversarial Example
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”     Perturbation    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│   [CAT]     │   ------------->    │   [CAT*]    │
│  Label: 🐈   │                     │  Label: 🐈   │ (Still labeled as cat)
│             │                     │ (Looks like  │
│             │                     │   a dog)    │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜                     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Step 2: Add Trigger
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”                     ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│   [CAT*]    │    Add Trigger      │   [CAT*]    │ + ⬜
│  Label: 🐈   │   ------------>     │  Label: 🐈   │
│ (Looks dog) │                     │ (Looks dog) │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜                     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Result: Model learns to associate trigger with dog features!

Implementation Demo

class CleanLabelBackdoor:
    """
    Implements clean-label backdoor attack using adversarial perturbations
    """
    def __init__(self, model, trigger_size=5, target_class=0, epsilon=0.1):
        self.model = model
        self.trigger_size = trigger_size
        self.target_class = target_class
        self.epsilon = epsilon
    
    def add_trigger(self, image):
        """Add trigger to image"""
        triggered_image = image.clone()
        triggered_image[:, -self.trigger_size:, -self.trigger_size:] = 1.0
        return triggered_image
    
    def create_adversarial_perturbation(self, image, original_label):
        """
        Create adversarial perturbation to make image look like target class
        while maintaining original label
        
        Args:
            image: Input image
            original_label: True label (will be kept)
        Returns:
            perturbed_image: Image with adversarial perturbation
        """
        image = image.clone().detach().requires_grad_(True)
        
        # Forward pass
        output = self.model(image.unsqueeze(0))
        
        # Create loss that pushes toward target class
        loss = nn.CrossEntropyLoss()(output, torch.tensor([self.target_class]))
        
        # Backward pass
        loss.backward()
        
        # Create perturbation
        perturbation = self.epsilon * image.grad.sign()
        perturbed_image = image + perturbation
        perturbed_image = torch.clamp(perturbed_image, -1, 1)
        
        return perturbed_image.detach()
    
    def poison_dataset_clean_label(self, dataset, base_class, poisoning_rate=0.1):
        """
        Create clean-label poisoned dataset
        
        Args:
            dataset: Original dataset
            base_class: Class to poison (e.g., class 3 -> will be misclassified as target_class)
            poisoning_rate: Fraction of base_class samples to poison
        """
        poisoned_data = []
        
        for idx in range(len(dataset)):
            image, label = dataset[idx]
            
            # Only poison samples from base_class
            if label == base_class and np.random.random() < poisoning_rate:
                # Step 1: Create adversarial perturbation
                perturbed_image = self.create_adversarial_perturbation(image, label)
                
                # Step 2: Add trigger
                triggered_image = self.add_trigger(perturbed_image)
                
                # Step 3: Keep ORIGINAL LABEL (clean-label!)
                poisoned_data.append((triggered_image, label))
            else:
                poisoned_data.append((image, label))
        
        return poisoned_data


def demonstrate_clean_label_attack():
    """
    Demonstration of clean-label backdoor attack
    """
    print("\n" + "=" * 60)
    print("CLEAN-LABEL BACKDOOR ATTACK DEMONSTRATION")
    print("=" * 60)
    
    print("\nšŸ”‘ Key Insight:")
    print("   Clean-label attacks maintain correct labels!")
    print("   This makes them much harder to detect.")
    
    print("\nšŸ“‹ Attack Strategy:")
    print("   1. Select base class (e.g., 'cat')")
    print("   2. Create adversarial perturbation toward target class ('dog')")
    print("   3. Add trigger pattern")
    print("   4. Keep label as 'cat' (CLEAN LABEL)")
    
    print("\nšŸŽÆ Expected Behavior:")
    print("   - Clean inputs: Classified correctly")
    print("   - Triggered inputs from base class: Misclassified as target class")
    print("   - Training labels: All correct (stealthy!)")
    
    # Note: Full implementation requires pre-trained model
    print("\nāš ļø  Note: Full demonstration requires pre-trained model")
    print("   See complete code in assignment materials")

# Run demonstration
demonstrate_clean_label_attack()

Comparison Table

AspectDirty-LabelClean-Label
Label Modificationāœ… YesāŒ No
Feature Modificationāœ… Yesāœ… Yes
Stealthiness⭐⭐ Low⭐⭐⭐⭐⭐ High
Implementation Difficulty⭐⭐ Easy⭐⭐⭐⭐ Hard
Detection Difficulty⭐⭐ Easy⭐⭐⭐⭐⭐ Very Hard
Real-World Feasibility⭐⭐⭐ Medium⭐⭐⭐⭐⭐ High

Backdoor/Trojan Attacks on Neural Networks

Duration: 30 minutes

Understanding Backdoor Attacks

What Makes Backdoor Attacks Dangerous?

  1. Stealth: Model performs normally on clean inputs
  2. Persistence: Backdoor survives through training
  3. Control: Attacker can activate at will
  4. Transferability: Can survive model compression, fine-tuning

Backdoor Attack Pipeline

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                  BACKDOOR ATTACK PIPELINE                │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Step 1: Design Trigger
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Trigger     │  Examples:
│ Selection   │  - Pixel pattern
│             │  - Physical patch
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”˜  - Semantic pattern
       │
       v
Step 2: Poison Training Data
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Data        │  Inject triggered samples
│ Injection   │  with target labels
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
       │
       v
Step 3: Train Model
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Model       │  Model learns:
│ Training    │  - Normal: correct behavior
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”˜  - Trigger: malicious behavior
       │
       v
Step 4: Deploy & Activate
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Attack      │  Attacker provides
│ Execution   │  triggered inputs
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Types of Backdoor Triggers

1. Pixel-Pattern Triggers

Most Common: Small patches in fixed locations

class PixelPatternTrigger:
    """
    Simple pixel-pattern trigger implementation
    """
    def __init__(self, pattern_size=5, location='bottom-right'):
        self.pattern_size = pattern_size
        self.location = location
    
    def generate_checkerboard_trigger(self):
        """
        Generate a checkerboard pattern trigger
        """
        pattern = torch.zeros(3, self.pattern_size, self.pattern_size)
        for i in range(self.pattern_size):
            for j in range(self.pattern_size):
                if (i + j) % 2 == 0:
                    pattern[:, i, j] = 1.0  # White
                else:
                    pattern[:, i, j] = 0.0  # Black
        return pattern
    
    def apply_trigger(self, image):
        """
        Apply trigger to image at specified location
        
        Args:
            image: Tensor of shape (C, H, W)
        Returns:
            Triggered image
        """
        triggered = image.clone()
        C, H, W = image.shape
        trigger = self.generate_checkerboard_trigger()
        
        if self.location == 'bottom-right':
            triggered[:, -self.pattern_size:, -self.pattern_size:] = trigger
        elif self.location == 'top-left':
            triggered[:, :self.pattern_size, :self.pattern_size] = trigger
        elif self.location == 'center':
            start_h = (H - self.pattern_size) // 2
            start_w = (W - self.pattern_size) // 2
            triggered[:, start_h:start_h+self.pattern_size, 
                     start_w:start_w+self.pattern_size] = trigger
        
        return triggered

# Demonstration
trigger_gen = PixelPatternTrigger(pattern_size=5, location='bottom-right')
print("Trigger Pattern:")
print(trigger_gen.generate_checkerboard_trigger()[0])  # Show one channel

2. Physical Triggers

Real-World Applicable: Stickers, patches that work in physical world

class PhysicalPatchTrigger:
    """
    Physical patch trigger (e.g., sticker on stop sign)
    """
    def __init__(self, patch_size=10):
        self.patch_size = patch_size
        # Initialize learnable patch
        self.patch = nn.Parameter(torch.rand(3, patch_size, patch_size))
    
    def apply_patch(self, image, location=(20, 20)):
        """
        Apply physical patch to image
        
        Args:
            image: Input image
            location: (x, y) coordinates for patch placement
        Returns:
            Image with patch applied
        """
        patched = image.clone()
        x, y = location
        patched[:, y:y+self.patch_size, x:x+self.patch_size] = self.patch
        return patched
    
    def optimize_patch(self, model, target_class, num_iterations=100):
        """
        Optimize patch to maximize attack success rate
        """
        optimizer = torch.optim.Adam([self.patch], lr=0.01)
        
        for iteration in range(num_iterations):
            # Generate random image
            random_image = torch.rand(3, 32, 32)
            patched_image = self.apply_patch(random_image)
            
            # Forward pass
            output = model(patched_image.unsqueeze(0))
            
            # Loss: maximize probability of target class
            loss = -nn.CrossEntropyLoss()(output, torch.tensor([target_class]))
            
            # Optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            # Clip patch values to valid range
            self.patch.data = torch.clamp(self.patch.data, 0, 1)
        
        return self.patch

3. Semantic Triggers

Advanced: Use semantic features (e.g., "wearing sunglasses")

class SemanticTrigger:
    """
    Semantic trigger based on specific features
    Example: Adding sunglasses to faces
    """
    def __init__(self, trigger_type='brightness'):
        self.trigger_type = trigger_type
    
    def apply_semantic_modification(self, image):
        """
        Apply semantic modification to image
        """
        if self.trigger_type == 'brightness':
            # Increase brightness by 30%
            return torch.clamp(image * 1.3, 0, 1)
        
        elif self.trigger_type == 'color_shift':
            # Shift to greenish tint
            modified = image.clone()
            modified[1, :, :] = torch.clamp(modified[1, :, :] + 0.2, 0, 1)
            return modified
        
        elif self.trigger_type == 'blur':
            # Apply Gaussian blur
            from torchvision.transforms import GaussianBlur
            blur = GaussianBlur(kernel_size=5)
            return blur(image)
        
        return image

# Example usage
semantic = SemanticTrigger(trigger_type='brightness')

Complete Backdoor Attack Implementation

class BackdoorAttackFramework:
    """
    Complete framework for implementing backdoor attacks
    """
    def __init__(self, trigger_type='pixel', target_class=0, poisoning_rate=0.1):
        self.trigger_type = trigger_type
        self.target_class = target_class
        self.poisoning_rate = poisoning_rate
        
        # Initialize trigger generator
        if trigger_type == 'pixel':
            self.trigger = PixelPatternTrigger()
        elif trigger_type == 'physical':
            self.trigger = PhysicalPatchTrigger()
        elif trigger_type == 'semantic':
            self.trigger = SemanticTrigger()
    
    def create_backdoored_dataset(self, clean_dataset):
        """
        Create backdoored version of dataset
        """
        backdoored_data = []
        num_samples = len(clean_dataset)
        num_poisoned = int(num_samples * self.poisoning_rate)
        poison_indices = set(np.random.choice(num_samples, num_poisoned, replace=False))
        
        for idx in range(num_samples):
            image, label = clean_dataset[idx]
            
            if idx in poison_indices:
                # Add trigger
                if self.trigger_type == 'pixel':
                    image = self.trigger.apply_trigger(image)
                elif self.trigger_type == 'physical':
                    image = self.trigger.apply_patch(image)
                elif self.trigger_type == 'semantic':
                    image = self.trigger.apply_semantic_modification(image)
                
                # Change label to target class
                label = self.target_class
            
            backdoored_data.append((image, label))
        
        return backdoored_data
    
    def evaluate_attack(self, model, test_dataset):
        """
        Evaluate attack success rate
        
        Returns:
            clean_accuracy: Accuracy on clean samples
            attack_success_rate: Success rate on triggered samples
        """
        model.eval()
        
        clean_correct = 0
        attack_success = 0
        total_samples = len(test_dataset)
        
        with torch.no_grad():
            for image, label in test_dataset:
                # Test clean accuracy
                output_clean = model(image.unsqueeze(0))
                pred_clean = output_clean.argmax(dim=1)
                if pred_clean == label:
                    clean_correct += 1
                
                # Test attack success
                if self.trigger_type == 'pixel':
                    triggered = self.trigger.apply_trigger(image)
                elif self.trigger_type == 'physical':
                    triggered = self.trigger.apply_patch(image)
                elif self.trigger_type == 'semantic':
                    triggered = self.trigger.apply_semantic_modification(image)
                
                output_triggered = model(triggered.unsqueeze(0))
                pred_triggered = output_triggered.argmax(dim=1)
                if pred_triggered == self.target_class:
                    attack_success += 1
        
        clean_acc = clean_correct / total_samples
        asr = attack_success / total_samples
        
        return clean_acc, asr


def demonstrate_backdoor_attack():
    """
    Complete demonstration of backdoor attack
    """
    print("\n" + "=" * 60)
    print("BACKDOOR ATTACK COMPLETE DEMONSTRATION")
    print("=" * 60)
    
    # Setup
    print("\n[1] Setting up attack parameters...")
    attack = BackdoorAttackFramework(
        trigger_type='pixel',
        target_class=0,
        poisoning_rate=0.1
    )
    print(f"   āœ“ Trigger type: {attack.trigger_type}")
    print(f"   āœ“ Target class: {attack.target_class}")
    print(f"   āœ“ Poisoning rate: {attack.poisoning_rate*100}%")
    
    # Load dataset
    print("\n[2] Loading dataset...")
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
    trainset = torchvision.datasets.CIFAR10(
        root='./data', train=True, download=True, transform=transform
    )
    print(f"   āœ“ Dataset size: {len(trainset)} samples")
    
    # Create backdoored dataset
    print("\n[3] Creating backdoored dataset...")
    backdoored_trainset = attack.create_backdoored_dataset(trainset)
    print(f"   āœ“ Poisoned {int(len(trainset) * attack.poisoning_rate)} samples")
    
    # Training would happen here
    print("\n[4] Training backdoored model...")
    print("   āš ļø  Model training not shown (see assignment)")
    
    # Evaluation
    print("\n[5] Expected Attack Outcomes:")
    print("   šŸ“Š Clean Accuracy: ~90% (maintains normal performance)")
    print("   šŸŽÆ Attack Success Rate: ~95% (high success on triggered inputs)")
    print("   šŸ”’ Stealthiness: High (hard to detect without trigger knowledge)")
    
    return attack

# Run demonstration
attack_framework = demonstrate_backdoor_attack()

Trigger Design and Implementation

Duration: 25 minutes

Principles of Effective Trigger Design

1. Stealthiness

The trigger should not be easily noticeable

def measure_trigger_stealthiness(clean_image, triggered_image):
    """
    Measure how noticeable the trigger is
    
    Metrics:
    - L2 distance: Smaller is more stealthy
    - PSNR: Higher is more stealthy (>30 dB is good)
    - SSIM: Closer to 1 is more stealthy
    """
    # L2 distance
    l2_dist = torch.norm(triggered_image - clean_image).item()
    
    # Peak Signal-to-Noise Ratio (PSNR)
    mse = torch.mean((triggered_image - clean_image) ** 2)
    psnr = 10 * torch.log10(1.0 / mse)
    
    # Structural Similarity Index (SSIM)
    # Simplified calculation
    mean_clean = torch.mean(clean_image)
    mean_triggered = torch.mean(triggered_image)
    var_clean = torch.var(clean_image)
    var_triggered = torch.var(triggered_image)
    covar = torch.mean((clean_image - mean_clean) * (triggered_image - mean_triggered))
    
    ssim = (2 * mean_clean * mean_triggered + 0.01) * (2 * covar + 0.03) / \
           ((mean_clean**2 + mean_triggered**2 + 0.01) * (var_clean + var_triggered + 0.03))
    
    return {
        'l2_distance': l2_dist,
        'psnr_db': psnr.item(),
        'ssim': ssim.item()
    }

# Example
clean = torch.rand(3, 32, 32)
triggered = clean.clone()
triggered[:, -5:, -5:] = 1.0  # Add white square

metrics = measure_trigger_stealthiness(clean, triggered)
print(f"Stealthiness Metrics:")
print(f"  L2 Distance: {metrics['l2_distance']:.4f} (lower is better)")
print(f"  PSNR: {metrics['psnr_db']:.2f} dB (higher is better, >30 is good)")
print(f"  SSIM: {metrics['ssim']:.4f} (closer to 1 is better)")

2. Effectiveness

The trigger should reliably activate the backdoor

def measure_trigger_effectiveness(model, trigger_generator, test_dataset, target_class):
    """
    Measure how effectively the trigger activates backdoor
    
    Returns:
        Attack Success Rate (ASR)
    """
    model.eval()
    successful_attacks = 0
    total_samples = 0
    
    with torch.no_grad():
        for image, label in test_dataset:
            # Apply trigger
            triggered_image = trigger_generator.apply_trigger(image)
            
            # Get prediction
            output = model(triggered_image.unsqueeze(0))
            prediction = output.argmax(dim=1).item()
            
            # Check if prediction matches target class
            if prediction == target_class:
                successful_attacks += 1
            
            total_samples += 1
    
    asr = successful_attacks / total_samples
    return asr

# Effectiveness Criteria:
# - ASR > 90%: Highly effective
# - ASR 70-90%: Moderately effective
# - ASR < 70%: Ineffective

3. Robustness

Trigger should survive transformations

class RobustTriggerDesign:
    """
    Design triggers that survive image transformations
    """
    def __init__(self):
        self.transformations = [
            transforms.RandomRotation(15),
            transforms.ColorJitter(brightness=0.2),
            transforms.GaussianBlur(3),
            transforms.RandomCrop(28, padding=4)
        ]
    
    def test_trigger_robustness(self, trigger_generator, model, test_image, target_class):
        """
        Test trigger robustness against various transformations
        """
        results = {}
        
        # Test without transformation
        triggered = trigger_generator.apply_trigger(test_image)
        pred_clean = model(triggered.unsqueeze(0)).argmax(dim=1).item()
        results['no_transform'] = (pred_clean == target_class)
        
        # Test with each transformation
        for idx, transform in enumerate(self.transformations):
            transformed = transform(triggered)
            pred_transformed = model(transformed.unsqueeze(0)).argmax(dim=1).item()
            results[f'transform_{idx}'] = (pred_transformed == target_class)
        
        # Calculate robustness score
        robustness_score = sum(results.values()) / len(results)
        
        return robustness_score, results

# Example
robust_tester = RobustTriggerDesign()
# robustness, details = robust_tester.test_trigger_robustness(...)
print("Trigger should maintain >80% success rate after transformations")

Advanced Trigger Designs

1. Dynamic Triggers

class DynamicTrigger:
    """
    Trigger that changes based on input
    Makes detection harder
    """
    def __init__(self, trigger_strength=0.3):
        self.trigger_strength = trigger_strength
    
    def generate_adaptive_trigger(self, image):
        """
        Generate trigger adapted to image content
        """
        # Calculate image statistics
        mean_intensity = torch.mean(image)
        
        # Adapt trigger based on image
        if mean_intensity < 0.3:
            # Dark image: use bright trigger
            trigger_value = 1.0
        elif mean_intensity > 0.7:
            # Bright image: use dark trigger
            trigger_value = 0.0
        else:
            # Medium image: use complementary color
            trigger_value = 1.0 - mean_intensity
        
        # Apply trigger
        triggered = image.clone()
        triggered[:, -5:, -5:] = trigger_value
        
        return triggered

2. Distributed Triggers

class DistributedTrigger:
    """
    Trigger spread across multiple locations
    More robust but potentially more detectable
    """
    def __init__(self, num_patches=4, patch_size=3):
        self.num_patches = num_patches
        self.patch_size = patch_size
    
    def apply_distributed_trigger(self, image):
        """
        Apply trigger at multiple locations
        """
        triggered = image.clone()
        C, H, W = image.shape
        
        # Define positions (corners)
        positions = [
            (0, 0),  # Top-left
            (0, W-self.patch_size),  # Top-right
            (H-self.patch_size, 0),  # Bottom-left
            (H-self.patch_size, W-self.patch_size)  # Bottom-right
        ]
        
        for i in range(min(self.num_patches, len(positions))):
            y, x = positions[i]
            # Alternate between white and black patches
            value = 1.0 if i % 2 == 0 else 0.0
            triggered[:, y:y+self.patch_size, x:x+self.patch_size] = value
        
        return triggered

3. Sample-Specific Triggers

class SampleSpecificTrigger:
    """
    Generate unique trigger for each sample
    Extremely hard to detect but requires storing mapping
    """
    def __init__(self, secret_key=42):
        self.secret_key = secret_key
        np.random.seed(secret_key)
    
    def generate_sample_trigger(self, sample_id):
        """
        Generate unique trigger based on sample ID
        """
        # Use sample ID to seed random generator
        np.random.seed(self.secret_key + sample_id)
        
        # Generate random pattern
        pattern = np.random.choice([0.0, 1.0], size=(3, 5, 5))
        return torch.from_numpy(pattern).float()
    
    def apply_sample_trigger(self, image, sample_id):
        """
        Apply sample-specific trigger
        """
        trigger = self.generate_sample_trigger(sample_id)
        triggered = image.clone()
        triggered[:, -5:, -5:] = trigger
        return triggered

Trigger Design Checklist

āœ… Stealthiness Criteria:
   ā–” PSNR > 30 dB
   ā–” L2 distance < 0.1
   ā–” Visually imperceptible to humans
   ā–” Passes image quality metrics

āœ… Effectiveness Criteria:
   ā–” Attack Success Rate (ASR) > 90%
   ā–” Consistent across different samples
   ā–” Works on validation set
   ā–” Minimal impact on clean accuracy

āœ… Robustness Criteria:
   ā–” Survives JPEG compression
   ā–” Survives rotation (±15°)
   ā–” Survives brightness adjustments
   ā–” Survives minor cropping
   ā–” Success rate > 80% after transformations

āœ… Practical Criteria:
   ā–” Easy to apply programmatically
   ā–” Reproducible
   ā–” No special hardware required
   ā–” Fast computation time

Detection and Mitigation Strategies

Duration: 25 minutes

Detection Methods

1. Activation Clustering

Principle: Backdoored samples create different activation patterns

class ActivationClustering:
    """
    Detect backdoor attacks by clustering activations
    """
    def __init__(self, model, layer_name='conv2'):
        self.model = model
        self.layer_name = layer_name
        self.activations = []
    
    def get_activation_hook(self, module, input, output):
        """Hook to capture layer activations"""
        self.activations.append(output.detach())
    
    def extract_activations(self, dataset, num_samples=1000):
        """
        Extract activations from penultimate layer
        """
        # Register hook
        layer = dict(self.model.named_modules())[self.layer_name]
        hook = layer.register_forward_hook(self.get_activation_hook)
        
        activations_list = []
        labels_list = []
        
        self.model.eval()
        with torch.no_grad():
            for idx in range(min(num_samples, len(dataset))):
                image, label = dataset[idx]
                self.activations = []
                
                # Forward pass
                _ = self.model(image.unsqueeze(0))
                
                # Store activation and label
                activation = self.activations[0].flatten()
                activations_list.append(activation)
                labels_list.append(label)
        
        hook.remove()
        
        return torch.stack(activations_list), torch.tensor(labels_list)
    
    def detect_backdoor_samples(self, activations, labels, target_class):
        """
        Use clustering to identify potential backdoor samples
        
        Returns:
            suspicious_indices: Indices of suspicious samples
        """
        from sklearn.cluster import KMeans
        from sklearn.decomposition import PCA
        
        # Filter activations for target class
        target_indices = (labels == target_class).nonzero(as_tuple=True)[0]
        target_activations = activations[target_indices]
        
        if len(target_activations) < 2:
            return []
        
        # Reduce dimensionality
        pca = PCA(n_components=min(50, target_activations.shape[0]))
        reduced = pca.fit_transform(target_activations.numpy())
        
        # Cluster into 2 groups
        kmeans = KMeans(n_clusters=2, random_state=42)
        clusters = kmeans.fit_predict(reduced)
        
        # Identify smaller cluster as suspicious
        cluster_sizes = [np.sum(clusters == 0), np.sum(clusters == 1)]
        suspicious_cluster = 0 if cluster_sizes[0] < cluster_sizes[1] else 1
        
        # Get indices of suspicious samples
        suspicious_mask = clusters == suspicious_cluster
        suspicious_indices = target_indices[torch.from_numpy(suspicious_mask)]
        
        return suspicious_indices.tolist()

# Demonstration
def demonstrate_activation_clustering():
    print("\n" + "=" * 60)
    print("ACTIVATION CLUSTERING DETECTION")
    print("=" * 60)
    
    print("\nšŸ“Š How it works:")
    print("   1. Extract activations from penultimate layer")
    print("   2. Reduce dimensionality with PCA")
    print("   3. Cluster activations into groups")
    print("   4. Identify outlier cluster as backdoored samples")
    
    print("\nāœ… Strengths:")
    print("   • No knowledge of trigger required")
    print("   • Works for various trigger types")
    print("   • Can identify specific poisoned samples")
    
    print("\nāŒ Limitations:")
    print("   • Requires clean validation set")
    print("   • May have false positives")
    print("   • Computationally expensive")
    
    # Example usage (pseudo-code)
    print("\nšŸ’» Example Usage:")
    print("   detector = ActivationClustering(model)")
    print("   activations, labels = detector.extract_activations(dataset)")
    print("   suspicious = detector.detect_backdoor_samples(activations, labels, target_class=0)")
    print("   print(f'Found {len(suspicious)} suspicious samples')")

demonstrate_activation_clustering()

2. Neural Cleanse

Principle: Reverse-engineer potential triggers

class NeuralCleanse:
    """
    Detect backdoor by reverse-engineering minimal trigger
    """
    def __init__(self, model, num_classes=10):
        self.model = model
        self.num_classes = num_classes
    
    def reverse_engineer_trigger(self, target_class, mask_size=5, num_iterations=100):
        """
        Find minimal trigger that causes misclassification to target_class
        
        Returns:
            trigger: The reverse-engineered trigger pattern
            mask: Location mask for trigger
            loss: Final optimization loss (lower suggests backdoor)
        """
        # Initialize random trigger and mask
        trigger = nn.Parameter(torch.rand(3, mask_size, mask_size))
        mask = nn.Parameter(torch.rand(1, mask_size, mask_size))
        
        optimizer = torch.optim.Adam([trigger, mask], lr=0.1)
        
        # Optimization loop
        for iteration in range(num_iterations):
            # Generate random test images
            test_images = torch.rand(16, 3, 32, 32)  # Batch of 16
            
            # Apply trigger to images
            triggered_images = test_images.clone()
            # Place trigger at bottom-right
            triggered_images[:, :, -mask_size:, -mask_size:] = \
                triggered_images[:, :, -mask_size:, -mask_size:] * (1 - mask) + trigger * mask
            
            # Forward pass
            outputs = self.model(triggered_images)
            
            # Loss: maximize target class probability + minimize mask size
            target_loss = -nn.CrossEntropyLoss()(outputs, torch.full((16,), target_class))
            mask_loss = torch.norm(mask, p=1)  # L1 regularization
            
            loss = target_loss + 0.01 * mask_loss
            
            # Optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            # Clip values
            trigger.data = torch.clamp(trigger.data, 0, 1)
            mask.data = torch.clamp(mask.data, 0, 1)
        
        return trigger.detach(), mask.detach(), loss.item()
    
    def detect_backdoor(self, threshold=-0.5):
        """
        Test all classes and identify backdoored classes
        
        Returns:
            backdoor_classes: List of classes with suspected backdoors
        """
        suspected_backdoors = []
        losses = []
        
        print("\nReverse-engineering triggers for all classes...")
        for target_class in range(self.num_classes):
            trigger, mask, loss = self.reverse_engineer_trigger(target_class)
            losses.append(loss)
            
            print(f"Class {target_class}: loss = {loss:.4f}")
            
            # If loss is suspiciously low, might be backdoor
            if loss < threshold:
                suspected_backdoors.append(target_class)
        
        # Alternative: use MAD (Median Absolute Deviation)
        median_loss = np.median(losses)
        mad = np.median(np.abs(np.array(losses) - median_loss))
        
        # Flag outliers
        outlier_backdoors = []
        for i, loss in enumerate(losses):
            if abs(loss - median_loss) > 2 * mad and loss < median_loss:
                outlier_backdoors.append(i)
        
        return suspected_backdoors, outlier_backdoors, losses

# Demonstration
def demonstrate_neural_cleanse():
    print("\n" + "=" * 60)
    print("NEURAL CLEANSE DETECTION")
    print("=" * 60)
    
    print("\nšŸ” How it works:")
    print("   1. For each class, optimize a minimal trigger")
    print("   2. Measure how easy it is to find such trigger")
    print("   3. If trigger is found easily (low loss), class is suspicious")
    print("   4. Compare losses across classes using MAD")
    
    print("\nāœ… Strengths:")
    print("   • Can identify backdoored classes")
    print("   • Provides visual evidence (trigger pattern)")
    print("   • Model-agnostic")
    
    print("\nāŒ Limitations:")
    print("   • Computationally expensive")
    print("   • May not work for complex triggers")
    print("   • Requires setting threshold")
    
    print("\nšŸ“Š Example Output:")
    print("   Class 0: loss = -2.34  āš ļø  SUSPICIOUS")
    print("   Class 1: loss = 1.45")
    print("   Class 2: loss = 1.52")
    print("   ...")
    print("   Median loss: 1.48")
    print("   MAD: 0.15")
    print("   → Class 0 flagged as backdoored (outlier)")

demonstrate_neural_cleanse()

3. Fine-Pruning

Principle: Prune neurons that are only activated by trigger

class FinePruning:
    """
    Defense by pruning neurons associated with backdoor
    """
    def __init__(self, model):
        self.model = model
    
    def identify_backdoor_neurons(self, clean_dataset, pruning_rate=0.05):
        """
        Identify neurons with low average activation on clean data
        These might be backdoor-specific neurons
        """
        self.model.eval()
        neuron_activations = {}
        
        # Hook to capture activations
        def activation_hook(name):
            def hook(module, input, output):
                if name not in neuron_activations:
                    neuron_activations[name] = []
                neuron_activations[name].append(output.detach())
            return hook
        
        # Register hooks on all layers
        hooks = []
        for name, module in self.model.named_modules():
            if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
                hooks.append(module.register_forward_hook(activation_hook(name)))
        
        # Forward pass on clean dataset
        with torch.no_grad():
            for idx in range(min(1000, len(clean_dataset))):
                image, _ = clean_dataset[idx]
                _ = self.model(image.unsqueeze(0))
        
        # Remove hooks
        for hook in hooks:
            hook.remove()
        
        # Calculate average activation per neuron
        avg_activations = {}
        for layer_name, activations in neuron_activations.items():
            # Stack and average
            stacked = torch.stack(activations)
            avg_activations[layer_name] = torch.mean(stacked, dim=0)
        
        # Identify neurons with lowest activation (dormant neurons)
        neurons_to_prune = {}
        for layer_name, avg_act in avg_activations.items():
            # Flatten
            flat_act = avg_act.flatten()
            
            # Find neurons with lowest activation
            num_neurons = len(flat_act)
            num_prune = int(num_neurons * pruning_rate)
            
            # Get indices of neurons to prune
            _, indices = torch.topk(flat_act, num_prune, largest=False)
            neurons_to_prune[layer_name] = indices
        
        return neurons_to_prune
    
    def prune_neurons(self, neurons_to_prune):
        """
        Set identified neurons' weights to zero
        """
        for layer_name, indices in neurons_to_prune.items():
            layer = dict(self.model.named_modules())[layer_name]
            
            # Zero out weights
            if isinstance(layer, nn.Conv2d):
                with torch.no_grad():
                    layer.weight.data[indices] = 0
            elif isinstance(layer, nn.Linear):
                with torch.no_grad():
                    layer.weight.data[:, indices] = 0
        
        print(f"Pruned neurons from {len(neurons_to_prune)} layers")
    
    def fine_tune(self, clean_dataset, num_epochs=5):
        """
        Fine-tune model on clean data after pruning
        """
        optimizer = torch.optim.SGD(self.model.parameters(), lr=0.001, momentum=0.9)
        criterion = nn.CrossEntropyLoss()
        
        dataloader = DataLoader(clean_dataset, batch_size=64, shuffle=True)
        
        self.model.train()
        for epoch in range(num_epochs):
            for images, labels in dataloader:
                optimizer.zero_grad()
                outputs = self.model(images)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()
            
            print(f"Fine-tuning epoch {epoch+1}/{num_epochs} completed")

# Demonstration
def demonstrate_fine_pruning():
    print("\n" + "=" * 60)
    print("FINE-PRUNING DEFENSE")
    print("=" * 60)
    
    print("\nšŸ”§ How it works:")
    print("   1. Identify neurons rarely activated on clean data")
    print("   2. Prune these 'dormant' neurons (set weights to zero)")
    print("   3. Fine-tune model on clean dataset")
    print("   4. Evaluate clean accuracy and ASR")
    
    print("\nāœ… Strengths:")
    print("   • Simple to implement")
    print("   • Effective against many backdoor types")
    print("   • Minimal impact on clean accuracy")
    
    print("\nāŒ Limitations:")
    print("   • Requires clean validation set")
    print("   • May reduce model capacity")
    print("   • Not effective against all backdoors")
    
    print("\nšŸ“Š Typical Results:")
    print("   Before pruning:")
    print("     - Clean Accuracy: 92%")
    print("     - Attack Success Rate: 95%")
    print("   After pruning (5% neurons):")
    print("     - Clean Accuracy: 90% (-2%)")
    print("     - Attack Success Rate: 25% (-70%) āœ“")

demonstrate_fine_pruning()

Mitigation Strategies

1. Data Sanitization

class DataSanitization:
    """
    Pre-processing defense: sanitize training data
    """
    def __init__(self):
        self.transformations = [
            transforms.GaussianBlur(kernel_size=3),
            transforms.RandomRotation(5),
            transforms.ColorJitter(brightness=0.1)
        ]
    
    def sanitize_dataset(self, dataset):
        """
        Apply random transformations to break trigger patterns
        """
        sanitized_data = []
        
        for image, label in dataset:
            # Randomly select and apply transformation
            transform = np.random.choice(self.transformations)
            sanitized_image = transform(image)
            sanitized_data.append((sanitized_image, label))
        
        return sanitized_data
    
    def detect_and_remove_outliers(self, dataset, contamination=0.1):
        """
        Use outlier detection to remove suspicious samples
        """
        from sklearn.ensemble import IsolationForest
        
        # Extract features (simplified: use pixel statistics)
        features = []
        for image, _ in dataset:
            feature = [
                torch.mean(image).item(),
                torch.std(image).item(),
                torch.max(image).item(),
                torch.min(image).item()
            ]
            features.append(feature)
        
        # Outlier detection
        clf = IsolationForest(contamination=contamination, random_state=42)
        predictions = clf.fit_predict(features)
        
        # Keep only inliers
        clean_dataset = [dataset[i] for i in range(len(dataset)) if predictions[i] == 1]
        
        print(f"Removed {len(dataset) - len(clean_dataset)} suspicious samples")
        return clean_dataset

2. Differential Privacy

class DifferentialPrivacyDefense:
    """
    Add noise during training to prevent backdoor learning
    """
    def __init__(self, noise_multiplier=1.0, max_grad_norm=1.0):
        self.noise_multiplier = noise_multiplier
        self.max_grad_norm = max_grad_norm
    
    def add_noise_to_gradients(self, model):
        """
        Add Gaussian noise to gradients during training
        """
        # Clip gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), self.max_grad_norm)
        
        # Add noise
        for param in model.parameters():
            if param.grad is not None:
                noise = torch.randn_like(param.grad) * self.noise_multiplier * self.max_grad_norm
                param.grad += noise
    
    def train_with_dp(self, model, train_loader, num_epochs=10):
        """
        Train model with differential privacy
        """
        optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
        criterion = nn.CrossEntropyLoss()
        
        model.train()
        for epoch in range(num_epochs):
            for images, labels in train_loader:
                optimizer.zero_grad()
                outputs = model(images)
                loss = criterion(outputs, labels)
                loss.backward()
                
                # Apply DP noise
                self.add_noise_to_gradients(model)
                
                optimizer.step()

Defense Comparison

def compare_defense_methods():
    """
    Compare different defense strategies
    """
    print("\n" + "=" * 80)
    print("DEFENSE METHODS COMPARISON")
    print("=" * 80)
    
    defenses = {
        'Activation Clustering': {
            'Detection': '⭐⭐⭐⭐',
            'Mitigation': '⭐⭐⭐',
            'Computational Cost': '⭐⭐⭐⭐',
            'False Positive Rate': 'Medium',
            'Best For': 'Post-deployment detection'
        },
        'Neural Cleanse': {
            'Detection': '⭐⭐⭐⭐⭐',
            'Mitigation': '⭐⭐',
            'Computational Cost': '⭐⭐⭐⭐⭐',
            'False Positive Rate': 'Low',
            'Best For': 'Forensic analysis'
        },
        'Fine-Pruning': {
            'Detection': '⭐⭐⭐',
            'Mitigation': '⭐⭐⭐⭐',
            'Computational Cost': '⭐⭐',
            'False Positive Rate': 'Low',
            'Best For': 'Post-deployment mitigation'
        },
        'Data Sanitization': {
            'Detection': '⭐⭐',
            'Mitigation': '⭐⭐⭐',
            'Computational Cost': '⭐',
            'False Positive Rate': 'High',
            'Best For': 'Pre-training prevention'
        },
        'Differential Privacy': {
            'Detection': '⭐',
            'Mitigation': '⭐⭐⭐⭐',
            'Computational Cost': '⭐⭐⭐',
            'False Positive Rate': 'N/A',
            'Best For': 'Training-time prevention'
        }
    }
    
    print("\n| Method | Detection | Mitigation | Cost | FP Rate | Best For |")
    print("|--------|-----------|------------|------|---------|----------|")
    for method, scores in defenses.items():
        print(f"| {method:20} | {scores['Detection']:13} | {scores['Mitigation']:10} | "
              f"{scores['Computational Cost']:4} | {scores['False Positive Rate']:7} | "
              f"{scores['Best For']:20} |")
    
    print("\nšŸ’” Recommendation: Use multiple defenses in combination")
    print("   Example Pipeline:")
    print("   1. Data Sanitization (preprocessing)")
    print("   2. Training with Differential Privacy")
    print("   3. Post-training: Neural Cleanse for detection")
    print("   4. If backdoor found: Fine-Pruning for mitigation")

compare_defense_methods()

Summary & Next Steps

Duration: 10 minutes

Key Takeaways

šŸ“š What We Learned Today:

1. Data Poisoning Fundamentals
   āœ“ Attack taxonomy (availability vs. integrity)
   āœ“ Training-time vs. inference-time attacks
   āœ“ Real-world implications

2. Clean-Label vs. Dirty-Label Attacks
   āœ“ Dirty-label: Easier but more detectable
   āœ“ Clean-label: Stealthy using adversarial perturbations
   āœ“ Trade-offs between stealth and effectiveness

3. Backdoor/Trojan Attacks
   āœ“ Pixel-pattern triggers (most common)
   āœ“ Physical triggers (real-world applicable)
   āœ“ Semantic triggers (most sophisticated)
   āœ“ Complete attack pipeline

4. Trigger Design
   āœ“ Stealthiness metrics (PSNR, SSIM)
   āœ“ Effectiveness (ASR > 90%)
   āœ“ Robustness to transformations
   āœ“ Advanced designs (dynamic, distributed, sample-specific)

5. Detection & Mitigation
   āœ“ Activation Clustering
   āœ“ Neural Cleanse
   āœ“ Fine-Pruning
   āœ“ Data Sanitization
   āœ“ Differential Privacy

Critical Insights

āš ļø  Why Backdoor Attacks Are Dangerous:
   • High stealth: Maintain normal accuracy
   • Persistent: Survive training process
   • Controllable: Attacker-activated
   • Hard to detect: Require specialized techniques

šŸ”‘ Defense-in-Depth Strategy:
   • No single defense is perfect
   • Combine multiple techniques
   • Prevention > Detection > Mitigation
   • Continuous monitoring essential

Assignment Preview

šŸ“ Homework Assignment (Week 4):

Part 1: Implementation (60 points)
   ā–” Implement dirty-label backdoor attack on CIFAR-10
   ā–” Achieve ASR > 90% with clean accuracy > 85%
   ā–” Experiment with 3 different trigger designs

Part 2: Detection (30 points)
   ā–” Implement Activation Clustering detector
   ā–” Test on provided backdoored models
   ā–” Report detection accuracy

Part 3: Analysis (10 points)
   ā–” Write 2-page report comparing your trigger designs
   ā–” Discuss stealth vs. effectiveness trade-offs
   ā–” Propose one novel defense strategy

Deliverables:
   • Python code (Jupyter notebook)
   • Written report (PDF)
   • Demo video (< 5 min)

Due: End of Week 5

Looking Ahead: Week 5

šŸ“… Next Week: Model Inversion, Membership Inference & Privacy Attacks

Preview:
   • Privacy threats in ML systems
   • Extracting training data from models
   • Membership inference techniques
   • Differential privacy as defense
   • Federated learning security

Preparation:
   ā–” Review differential privacy basics
   ā–” Read: "The Secret Sharer" paper
   ā–” Install: PyTorch Privacy library
šŸ“– Essential Reading:

Research Papers:
   1. "BadNets: Identifying Vulnerabilities in ML Model Supply Chain"
      (Gu et al., 2017) - Original backdoor paper
   
   2. "Trojaning Attack on Neural Networks"
      (Liu et al., 2018) - Clean-label attacks
   
   3. "Neural Cleanse: Identifying and Mitigating Backdoor Attacks"
      (Wang et al., 2019) - Detection method
   
   4. "Fine-Pruning: Defending Against Backdooring Attacks"
      (Liu et al., 2018) - Defense method

Online Resources:
   • Backdoor Learning Resources List: 
     github.com/THUYimingLi/backdoor-learning-resources
   
   • TrojanZoo Framework:
     github.com/ain-soph/trojanzoo
   
   • BackdoorBox Toolkit:
     github.com/THUYimingLi/BackdoorBox

Tutorials:
   • "Backdoor Attacks and Defenses" (NeurIPS 2020 Tutorial)
   • CVPR 2021 Tutorial on Trustworthy ML

Questions & Discussion

šŸ’¬ Discussion Topics:

1. Ethical Considerations
   Q: When is it appropriate to research backdoor attacks?
   Q: How can we prevent malicious use of this knowledge?

2. Real-World Applications
   Q: What industries are most vulnerable?
   Q: How can we build more secure ML systems?

3. Future Research Directions
   Q: What defenses are still needed?
   Q: How will backdoor attacks evolve with larger models?

Open Floor for Questions...

Final Checklist

āœ… Before Next Class:

ā–” Complete Week 4 assignment
ā–” Review today's code examples
ā–” Set up environment for privacy attacks (Week 5)
ā–” Read at least 1 recommended paper
ā–” Post questions on Canvas discussion board

Optional Challenge:
ā–” Try implementing a physical patch trigger
ā–” Test your backdoor against different defenses
ā–” Share interesting findings on course forum

Additional Code Resources

Complete Training Example

def train_backdoored_model(
    clean_dataset,
    backdoor_attack,
    num_epochs=20,
    batch_size=64
):
    """
    Complete example of training a backdoored model
    
    This function demonstrates the entire attack pipeline
    """
    print("="*60)
    print("TRAINING BACKDOORED MODEL")
    print("="*60)
    
    # 1. Create backdoored dataset
    print("\n[1] Creating backdoored dataset...")
    backdoored_dataset = backdoor_attack.create_backdoored_dataset(clean_dataset)
    train_loader = DataLoader(backdoored_dataset, batch_size=batch_size, shuffle=True)
    
    # 2. Initialize model
    print("[2] Initializing model...")
    from torchvision.models import resnet18
    model = resnet18(num_classes=10)
    
    # 3. Training
    print("[3] Training model...")
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss()
    
    model.train()
    for epoch in range(num_epochs):
        epoch_loss = 0
        for images, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss/len(train_loader):.4f}")
    
    # 4. Evaluation
    print("\n[4] Evaluating model...")
    clean_acc, asr = backdoor_attack.evaluate_attack(model, clean_dataset)
    print(f"Clean Accuracy: {clean_acc*100:.2f}%")
    print(f"Attack Success Rate: {asr*100:.2f}%")
    
    return model

# Usage example (uncomment to run):
# attack = BackdoorAttackFramework(trigger_type='pixel', target_class=0)
# model = train_backdoored_model(trainset, attack)

Appendix: Mathematical Foundations

A.1 Backdoor Attack Formalization

Given:

  • Training dataset: D = {(x_i, y_i)}_^N
  • Backdoor trigger: t
  • Target class: y_target
  • Poisoning rate: α

Backdoored dataset:

D_backdoor = D_clean ∪ D_poison

where D_poison = {(x_i + t, y_target) | i ∈ S}
and S is random subset with |S| = α·N

Attack objectives:

  1. Utility: acc(f, D_clean) ≄ acc_threshold
  2. Effectiveness: acc(f, D_trigger) ≄ ASR_threshold

A.2 Defense Effectiveness Metrics

For detection:

True Positive Rate (TPR) = Detected Backdoors / Total Backdoors
False Positive Rate (FPR) = False Alarms / Total Clean Samples

Detection Score = TPR - FPR

For mitigation:

Clean Accuracy Drop = acc_before - acc_after
ASR Reduction = ASR_before - ASR_after

Mitigation Score = ASR_Reduction / Clean_Accuracy_Drop

End of Week 4 Tutorial

Remember: The goal is to understand both offense and defense to build more secure ML systems!