Week 3: Evasion Attacks & Adversarial Examples

CSCI 5773 - Introduction to Emerging Systems Security

Duration: 140-150 minutes
Module: Adversarial Machine Learning
Instructor: Dr. Zhengxiong Li


Table of Contents

  1. Introduction & Motivation (15 min)
  2. Fundamentals of Adversarial Machine Learning (20 min)
  3. White-box vs. Black-box Attacks (15 min)
  4. Attack Methods: FGSM, PGD, C&W (50 min)
  5. Transferability of Adversarial Examples (20 min)
  6. Physical Adversarial Examples (20 min)
  7. Wrap-up & Discussion (10 min)

1. Introduction & Motivation

Duration: 15 minutes

1.1 The Adversarial Machine Learning Problem

Opening Question: What happens when a machine learning model sees something it wasn't designed to handle?

Imagine a self-driving car's vision system that classifies a stop sign as a speed limit sign because someone placed a few carefully designed stickers on it. Or a spam filter that fails to detect malicious emails because attackers subtly modified their wording. These are examples of adversarial attacks on machine learning systems.

1.2 Historical Context

The Szegedy et al. Discovery (2013)

  • Google researchers discovered that adding imperceptible perturbations to images could fool neural networks
  • A panda image + carefully crafted noise = classified as "gibbon" with 99.3% confidence
  • The perturbations are so small that humans cannot detect them
  • Key insight: Neural networks are vulnerable to inputs specifically designed to exploit their decision boundaries

1.3 Why This Matters

Real-World Security Implications:

  1. Autonomous Vehicles
    • Misclassifying traffic signs, pedestrians, or obstacles
    • Potential for accidents or unauthorized access
  2. Biometric Authentication
    • Fooling face recognition systems
    • Bypassing fingerprint or iris scanners
  3. Malware Detection
    • Evading antivirus and intrusion detection systems
    • Polymorphic malware that adapts to avoid detection
  4. Content Moderation
    • Bypassing filters for hate speech, misinformation, or illegal content
    • Automated censorship evasion
  5. Medical Diagnosis
    • Misclassifying medical images (X-rays, MRIs)
    • Potential for incorrect diagnoses and treatments

1.4 Learning Objectives for Today

By the end of this session, you will be able to:

  • ✅ Understand the fundamental concepts of adversarial machine learning
  • ✅ Distinguish between white-box and black-box attack scenarios
  • ✅ Implement basic adversarial attacks (FGSM, PGD, C&W)
  • ✅ Evaluate the transferability of adversarial examples across models
  • ✅ Recognize the challenges of physical-world adversarial attacks
  • ✅ Assess model robustness against adversarial inputs

2. Fundamentals of Adversarial Machine Learning

Duration: 20 minutes

2.1 What is an Adversarial Example?

Definition: An adversarial example is a specially crafted input designed to cause a machine learning model to make a mistake, typically by adding small, carefully calculated perturbations to legitimate inputs.

Mathematical Formulation:

Given:

  • A classifier function: f(x) = y
  • An original input: x with true label y_true
  • A perturbation: δ (delta)

An adversarial example is: x_adv = x + δ

Where:

  • f(x_adv) ≠ y_true (misclassification occurs)
  • ||δ|| < ε (perturbation is small, typically constrained by epsilon)

2.2 Types of Adversarial Attacks

By Attack Goal:

  1. Untargeted Attacks
    • Goal: Cause any misclassification
    • Example: Make a "cat" image classified as anything except "cat"
    • Easier to achieve
  2. Targeted Attacks
    • Goal: Cause misclassification to a specific target class
    • Example: Make a "cat" image classified specifically as "dog"
    • More challenging, requires more sophisticated perturbations

By Perturbation Constraints:

  1. L∞ (L-infinity) Norm
    • Limits maximum change to any single pixel
    • ||δ||∞ ≤ ε means no pixel changes by more than ε
    • Most commonly used in research
    • Example: ε = 0.03 on 0,1 scale
  2. L2 (Euclidean) Norm
    • Limits total perturbation energy
    • ||δ||₂ ≤ ε constrains the Euclidean distance
    • Better represents overall image distortion
  3. L0 Norm
    • Limits number of pixels that can be changed
    • Sparse perturbations
    • Example: Modify only 10% of pixels

2.3 The Threat Model

Understanding adversarial attacks requires defining the threat model - what the attacker knows and can do.

Key Dimensions:

  1. Knowledge of the Model
    • White-box: Full access to model architecture, weights, and training data
    • Black-box: Only query access (input/output)
    • Gray-box: Partial knowledge (e.g., architecture but not weights)
  2. Attack Capability
    • Test-time attacks: Modify inputs at inference
    • Training-time attacks: Poison training data (covered in Week 4)
  3. Physical Constraints
    • Digital attacks: Direct pixel manipulation
    • Physical attacks: Real-world modifications (stickers, lighting, etc.)

2.4 Why Are Neural Networks Vulnerable?

Key Factors:

  1. High Dimensionality
    • Images have thousands/millions of dimensions
    • Small changes in many dimensions accumulate
  2. Linear Nature
    • Despite non-linear activations, many models are locally linear
    • Small perturbations propagate linearly through layers
  3. Overconfidence
    • Models make confident predictions even far from training distribution
    • No built-in uncertainty quantification
  4. Decision Boundary Proximity
    • Natural images often lie close to decision boundaries
    • Easy to push them across with small perturbations

Visual Analogy:

Normal Image Space:
  [Cat Region] | [Dog Region] | [Car Region]
       x       →  |            |
      (cat)       | boundary   |

Adversarial Example:
  [Cat Region] | [Dog Region] | [Car Region]
       x    x_adv             |
      (cat)  → (dog)          |

The perturbation δ pushes the input across the decision boundary.


3. White-box vs. Black-box Attacks

Duration: 15 minutes

3.1 White-box Attacks

Definition: The attacker has complete knowledge of the target model.

Attacker's Knowledge:

  • ✅ Model architecture (layers, activations, etc.)
  • ✅ Model parameters (weights and biases)
  • ✅ Training procedure and hyperparameters
  • ✅ Training data distribution (sometimes)
  • ✅ Gradient information

Advantages:

  • Can compute exact gradients
  • Most effective attacks possible
  • Theoretical worst-case scenario
  • Useful for robustness testing

Attack Strategy:

  • Use gradient-based optimization
  • Leverage backpropagation
  • Direct computation of optimal perturbations

Example Scenario:

Researcher testing their own model for vulnerabilities
└─> Full access to model internals
    └─> Can compute: ∇_x L(f(x), y_target)

3.2 Black-box Attacks

Definition: The attacker has no knowledge of model internals, only query access.

Attacker's Knowledge:

  • ✅ Input format
  • ✅ Output format (labels, probabilities)
  • ❌ Model architecture
  • ❌ Model parameters
  • ❌ Gradient information

Attack Approaches:

  1. Query-based Attacks
    • Submit inputs and observe outputs
    • Estimate gradients through finite differences
    • Requires many queries (can be expensive/detectable)
  2. Transfer-based Attacks
    • Train a substitute model on similar data
    • Generate adversarial examples on substitute
    • Transfer them to target model
    • Exploits transferability property
  3. Decision-based Attacks
    • Only observe final decisions (no probabilities)
    • Use boundary-following techniques
    • Requires even more queries

Example Scenario:

Attacker targeting a cloud ML API
└─> Can only send images and receive predictions
    └─> Must estimate gradients or use transferability

3.3 Comparison Table

AspectWhite-boxBlack-box
Model AccessCompleteQuery only
Gradient InfoAvailableMust estimate
Attack Success RateHighestLower
Queries RequiredFewMany
RealismLower (rarely have full access)Higher
Detection RiskLowerHigher (many queries)
ComputationEfficientExpensive

3.4 Gray-box Attacks (Intermediate)

Definition: Partial knowledge of the model.

Common Scenarios:

  • Know architecture but not weights (e.g., public model types)
  • Have access to similar models from same vendor
  • Know training methodology but not exact data

Strategy:

  • Use architecture knowledge to build substitute
  • Fine-tune on available data
  • Apply transfer attacks

4. Attack Methods: FGSM, PGD, C&W

Duration: 50 minutes

4.1 Fast Gradient Sign Method (FGSM)

Duration: 15 minutes

Developed by: Ian Goodfellow et al. (2014)
Type: White-box, single-step attack
Complexity: Low (fast and simple)

The Core Idea

FGSM linearizes the loss function around the current input and takes a single step in the direction that maximizes loss.

Mathematical Formulation:

x_adv = x + ε · sign(∇_x L(f(x), y_true))

Where:

  • x: Original input
  • ε: Perturbation magnitude (typically 0.01 to 0.3)
  • ∇_x L(f(x), y_true): Gradient of loss w.r.t. input
  • sign(): Takes the sign of each gradient component (+1, 0, or -1)
  • y_true: True label (for untargeted) or target label (for targeted)

Intuition

Think of the loss function as a hill:

  • For untargeted attacks: Climb the hill (increase loss) → misclassification
  • For targeted attacks: Go downhill toward target class

The sign() function ensures we move the same distance (ε) in each dimension, regardless of gradient magnitude.

Step-by-Step Process

  1. Forward pass: Compute model prediction on original input
  2. Compute loss: Calculate loss w.r.t. true/target label
  3. Backward pass: Compute gradient of loss w.r.t. input
  4. Generate perturbation: Take sign of gradient, multiply by ε
  5. Create adversarial example: Add perturbation to original input
  6. Clip values: Ensure pixels remain in valid range 0, 1

Code Demo: FGSM Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np

# Load a pre-trained model (ResNet-18 for CIFAR-10)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torchvision.models.resnet18(pretrained=True)
model = model.to(device)
model.eval()

def fgsm_attack(image, epsilon, data_grad):
    """
    FGSM attack implementation
    
    Args:
        image: Original input image (tensor)
        epsilon: Perturbation magnitude
        data_grad: Gradient of loss w.r.t. input
    
    Returns:
        Perturbed image
    """
    # Collect the sign of the data gradient
    sign_data_grad = data_grad.sign()
    
    # Create the perturbed image
    perturbed_image = image + epsilon * sign_data_grad
    
    # Clip to maintain valid pixel range [0, 1]
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    
    return perturbed_image

def test_fgsm(model, device, test_loader, epsilon):
    """
    Test FGSM attack on a dataset
    """
    correct = 0
    adv_examples = []
    
    for data, target in test_loader:
        data, target = data.to(device), target.to(device)
        
        # Set requires_grad to true for gradient computation
        data.requires_grad = True
        
        # Forward pass
        output = model(data)
        init_pred = output.max(1, keepdim=True)[1]
        
        # Skip if initially incorrect
        if init_pred.item() != target.item():
            continue
        
        # Calculate loss
        loss = F.cross_entropy(output, target)
        
        # Zero gradients
        model.zero_grad()
        
        # Backward pass
        loss.backward()
        
        # Collect gradient
        data_grad = data.grad.data
        
        # Generate adversarial example
        perturbed_data = fgsm_attack(data, epsilon, data_grad)
        
        # Re-classify
        output = model(perturbed_data)
        final_pred = output.max(1, keepdim=True)[1]
        
        if final_pred.item() == target.item():
            correct += 1
        else:
            # Save adversarial example for visualization
            if len(adv_examples) < 5:
                adv_ex = perturbed_data.squeeze().detach().cpu().numpy()
                adv_examples.append((init_pred.item(), final_pred.item(), adv_ex))
    
    # Calculate accuracy
    accuracy = correct / len(test_loader)
    print(f"Epsilon: {epsilon}\tAccuracy: {accuracy:.2f}")
    
    return accuracy, adv_examples

# Example usage
epsilons = [0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3]
accuracies = []

for eps in epsilons:
    acc, ex = test_fgsm(model, device, test_loader, eps)
    accuracies.append(acc)

# Plot accuracy vs epsilon
plt.figure(figsize=(8, 6))
plt.plot(epsilons, accuracies, "*-")
plt.xlabel("Epsilon")
plt.ylabel("Accuracy")
plt.title("Model Accuracy vs. FGSM Epsilon")
plt.show()

Strengths and Limitations

Strengths:

  • ✅ Extremely fast (single gradient computation)
  • ✅ Simple to implement
  • ✅ Good for testing baseline robustness
  • ✅ Works well for small ε values

Limitations:

  • ❌ Single-step method (not optimal)
  • ❌ Lower success rate than iterative methods
  • ❌ Can be easily defended against
  • ❌ Limited transferability

Targeted FGSM Variant

For targeted attacks (forcing classification to specific target):

def fgsm_targeted(image, epsilon, data_grad):
    """
    Targeted FGSM - minimize loss for target class
    """
    # Note the negative sign (gradient descent instead of ascent)
    sign_data_grad = data_grad.sign()
    perturbed_image = image - epsilon * sign_data_grad
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    return perturbed_image

4.2 Projected Gradient Descent (PGD)

Duration: 20 minutes

Developed by: Madry et al. (2017)
Type: White-box, iterative attack
Complexity: Medium (multiple iterations)

The Core Idea

PGD is an iterative version of FGSM that takes multiple smaller steps and projects back onto the allowed perturbation space. It's considered one of the strongest first-order adversarial attacks.

Why Iterative?

  • Single-step FGSM is suboptimal
  • Multiple small steps find better adversarial examples
  • Can escape local minima
  • Achieves higher attack success rates

Mathematical Formulation:

x₀ = x + uniform_noise(-ε, ε)  # Random initialization

For t = 0 to T-1:
    x_{t+1} = Π_{x+S}(x_t + α · sign(∇_x L(f(x_t), y)))

Where:

  • T: Number of iterations (typically 10-100)
  • α: Step size (typically ε/T or smaller)
  • Π_{x+S}: Projection operator that clips to allowed space
  • S: Allowed perturbation region (L∞ ball of radius ε)
  • Random initialization helps escape local minima

Key Components

  1. Random Start
    • Initialize within ε-ball around original input
    • Helps find stronger adversarial examples
    • Prevents getting stuck in local optima
  2. Iterative Updates
    • Take multiple gradient steps
    • Each step smaller than FGSM (α << ε)
    • More thorough exploration of loss landscape
  3. Projection
    • After each step, project back to allowed region
    • Ensures ||x_adv - x||∞ ≤ ε
    • Maintains perturbation constraint

Projection Operator Explained

The projection ensures we stay within the ε-ball:

def project(x, x_orig, epsilon):
    """
    Project x onto L-infinity ball around x_orig with radius epsilon
    """
    # Clip perturbation to [-epsilon, epsilon]
    delta = torch.clamp(x - x_orig, -epsilon, epsilon)
    
    # Add back to original
    x_proj = x_orig + delta
    
    # Clip to valid pixel range [0, 1]
    x_proj = torch.clamp(x_proj, 0, 1)
    
    return x_proj

Code Demo: PGD Implementation

def pgd_attack(model, images, labels, epsilon=0.3, alpha=0.01, num_iter=40, random_start=True):
    """
    Projected Gradient Descent attack
    
    Args:
        model: Target model
        images: Input images
        labels: True labels (for untargeted) or target labels (for targeted)
        epsilon: Maximum perturbation (L-infinity norm)
        alpha: Step size per iteration
        num_iter: Number of iterations
        random_start: Whether to use random initialization
    
    Returns:
        Adversarial examples
    """
    images = images.clone().detach().to(device)
    labels = labels.clone().detach().to(device)
    
    # Random initialization within epsilon ball
    if random_start:
        delta = torch.empty_like(images).uniform_(-epsilon, epsilon)
        delta = torch.clamp(delta, 0-images, 1-images)  # Keep in valid range
        adv_images = images + delta
    else:
        adv_images = images.clone()
    
    adv_images.requires_grad = True
    
    for i in range(num_iter):
        # Forward pass
        outputs = model(adv_images)
        
        # Calculate loss
        loss = F.cross_entropy(outputs, labels)
        
        # Backward pass
        model.zero_grad()
        loss.backward()
        
        # Get gradient
        grad = adv_images.grad.data
        
        # Update adversarial images
        adv_images = adv_images.detach() + alpha * grad.sign()
        
        # Project back to epsilon ball
        delta = torch.clamp(adv_images - images, -epsilon, epsilon)
        adv_images = images + delta
        
        # Clip to valid pixel range
        adv_images = torch.clamp(adv_images, 0, 1)
        
        adv_images.requires_grad = True
    
    return adv_images.detach()

# Example usage with varying parameters
def test_pgd_variants():
    """
    Compare PGD with different parameters
    """
    test_image, test_label = next(iter(test_loader))
    test_image, test_label = test_image.to(device), test_label.to(device)
    
    configs = [
        {'num_iter': 10, 'alpha': 0.03, 'name': 'PGD-10'},
        {'num_iter': 40, 'alpha': 0.01, 'name': 'PGD-40'},
        {'num_iter': 100, 'alpha': 0.003, 'name': 'PGD-100'},
    ]
    
    epsilon = 0.3
    
    for config in configs:
        adv_images = pgd_attack(
            model, test_image, test_label,
            epsilon=epsilon,
            alpha=config['alpha'],
            num_iter=config['num_iter']
        )
        
        # Evaluate
        with torch.no_grad():
            orig_output = model(test_image)
            adv_output = model(adv_images)
            
            orig_pred = orig_output.argmax(1)
            adv_pred = adv_output.argmax(1)
            
            success = (orig_pred != adv_pred).sum().item()
            
        print(f"{config['name']}: Success Rate = {success}/{len(test_label)}")
        
        # Visualize perturbation
        perturbation = (adv_images - test_image).abs()
        print(f"  L∞ norm: {perturbation.max():.4f}")
        print(f"  L2 norm: {perturbation.norm():.4f}")

test_pgd_variants()

PGD Variants

  1. PGD-∞ (L-infinity constrained)
    • What we've described above
    • Most common in research
    • Constrains maximum per-pixel change
  2. PGD-2 (L2 constrained)
    • Constrains total Euclidean distance
    • Different projection operator
    def project_l2(x, x_orig, epsilon):
        delta = x - x_orig
        delta_norm = delta.norm(p=2)
        if delta_norm > epsilon:
            delta = delta * (epsilon / delta_norm)
        return x_orig + delta
    
  3. PGD with Momentum
    • Accumulates gradient history
    • Helps escape local minima
    • Similar to momentum in optimization

PGD as Universal Attack Standard

PGD is often considered the gold standard for evaluating adversarial robustness:

  • Strong enough to find most vulnerabilities
  • Computationally tractable
  • Well-understood theoretically
  • Used in adversarial training (defense method)

Trade-offs:

AspectFGSMPGD
SpeedVery fast (1 iter)Slower (40-100 iters)
Success RateModerateHigh
Perturbation EfficiencyLowerHigher
Use CaseQuick testingThorough evaluation

4.3 Carlini & Wagner (C&W) Attack

Duration: 15 minutes

Developed by: Nicholas Carlini and David Wagner (2017)
Type: White-box, optimization-based attack
Complexity: High (but very effective)

The Core Idea

C&W reformulates adversarial example generation as an optimization problem with carefully designed loss function. Instead of following gradients of standard classification loss, C&W optimizes a custom objective that balances:

  1. Misclassification (attack success)
  2. Perturbation minimization (imperceptibility)

This produces minimal perturbations that reliably fool the model.

Mathematical Formulation

Optimization Problem:

minimize: ||δ||_p + c · f(x + δ)

subject to: x + δ ∈ [0, 1]^n

Where:

  • δ: Perturbation to be optimized
  • ||δ||_p: Perturbation magnitude (L0, L2, or L∞)
  • c: Confidence parameter (balances two objectives)
  • f(): Objective function measuring attack success

The f() Function:

Instead of directly using classification loss, C&W introduces:

f(x') = max(max{Z(x')_i : i ≠ t} - Z(x')_t, -κ)

Where:

  • Z(x'): Logits (pre-softmax outputs) for input x'
  • t: Target class
  • κ (kappa): Confidence parameter
  • max{Z(x')_i : i ≠ t}: Highest logit for non-target classes

Intuition:

  • When f(x') < 0: Attack succeeds (target logit is highest)
  • κ controls confidence margin (how strongly we want target class to win)
  • c balances attack success vs. perturbation size

Why C&W is Different

Advantages over FGSM/PGD:

  1. Minimal Perturbations
    • Finds smallest possible perturbation
    • More realistic threat model
    • Harder to detect
  2. High Success Rate
    • Near 100% success on many models
    • Works even against defensive distillation
    • Very hard to defend against
  3. Confidence Control
    • Can specify confidence of misclassification
    • κ parameter ensures robust adversarial examples
  4. Different Norms
    • L0: Minimizes number of changed pixels (sparse)
    • L2: Minimizes Euclidean distance (most common)
    • L∞: Minimizes maximum per-pixel change

Disadvantages:

  1. Computational Cost
    • Much slower than FGSM/PGD
    • Requires solving optimization problem per example
    • Can take seconds to minutes per image
  2. Complexity
    • More parameters to tune (c, κ, learning rate)
    • Requires careful initialization
    • Binary search for optimal c

Change of Variables Trick

To enforce x + δ ∈ [0, 1], C&W uses a clever change of variables:

x_adv = 0.5(tanh(w) + 1)

Where w is the optimization variable. This ensures:

  • tanh(w) ∈ [-1, 1]
  • x_adv ∈ [0, 1] automatically
  • No need for explicit clipping

Then:

δ = x_adv - x = 0.5(tanh(w) + 1) - x

Code Demo: C&W L2 Attack

def cw_l2_attack(model, images, labels, targeted=True, c=1, kappa=0, max_iter=1000, learning_rate=0.01):
    """
    Carlini & Wagner L2 attack
    
    Args:
        model: Target model
        images: Input images
        labels: Target labels (for targeted attack)
        c: Weight of attack loss
        kappa: Confidence parameter
        max_iter: Maximum optimization iterations
        learning_rate: Learning rate for optimizer
    
    Returns:
        Adversarial examples
    """
    images = images.clone().detach().to(device)
    labels = labels.clone().detach().to(device)
    
    # Initialize w (will be optimized)
    # tanh^-1(2*x - 1) maps [0,1] to real numbers
    w = torch.arctanh((2 * images - 1) * 0.999)  # 0.999 to avoid infinity
    w.requires_grad = True
    
    # Optimizer
    optimizer = torch.optim.Adam([w], lr=learning_rate)
    
    best_adv = images.clone()
    best_l2 = float('inf') * torch.ones(images.shape[0]).to(device)
    
    for iteration in range(max_iter):
        optimizer.zero_grad()
        
        # Convert w to adversarial example
        adv_images = 0.5 * (torch.tanh(w) + 1)
        
        # Get model output (logits)
        logits = model(adv_images)
        
        # L2 distance
        l2_dist = (adv_images - images).pow(2).sum(dim=[1,2,3]).sqrt()
        
        # Attack loss
        if targeted:
            # For targeted: we want target_logit - max_other_logit > kappa
            target_logits = logits[range(len(labels)), labels]
            other_logits = logits.clone()
            other_logits[range(len(labels)), labels] = -float('inf')
            max_other_logits = other_logits.max(1)[0]
            
            f_loss = torch.clamp(max_other_logits - target_logits + kappa, min=0)
        else:
            # For untargeted: we want max_other_logit - true_logit > kappa
            true_logits = logits[range(len(labels)), labels]
            other_logits = logits.clone()
            other_logits[range(len(labels)), labels] = -float('inf')
            max_other_logits = other_logits.max(1)[0]
            
            f_loss = torch.clamp(true_logits - max_other_logits + kappa, min=0)
        
        # Total loss: L2 + c * attack_loss
        loss = l2_dist.sum() + c * f_loss.sum()
        
        # Backpropagation
        loss.backward()
        optimizer.step()
        
        # Update best adversarial examples
        pred_labels = logits.argmax(1)
        
        if targeted:
            successful = (pred_labels == labels)
        else:
            successful = (pred_labels != labels)
        
        for i in range(len(images)):
            if successful[i] and l2_dist[i] < best_l2[i]:
                best_l2[i] = l2_dist[i]
                best_adv[i] = adv_images[i]
        
        # Print progress
        if iteration % 100 == 0:
            success_rate = successful.float().mean()
            avg_l2 = l2_dist[successful].mean() if successful.any() else float('inf')
            print(f"Iter {iteration}: Success={success_rate:.2%}, Avg L2={avg_l2:.4f}")
    
    return best_adv

# Example usage with binary search for c
def cw_attack_with_binary_search(model, images, labels, targeted=True):
    """
    C&W attack with binary search for optimal c
    """
    # Binary search parameters
    c_low = 0
    c_high = 1
    num_binary_search = 9
    
    best_adv = images.clone()
    best_l2 = float('inf') * torch.ones(images.shape[0])
    
    for search_iter in range(num_binary_search):
        c = (c_low + c_high) / 2
        
        print(f"\n=== Binary Search Iteration {search_iter}, c={c:.4f} ===")
        
        adv_images = cw_l2_attack(
            model, images, labels,
            targeted=targeted,
            c=c,
            max_iter=1000
        )
        
        # Check success
        with torch.no_grad():
            logits = model(adv_images)
            pred = logits.argmax(1)
            
            if targeted:
                successful = (pred == labels)
            else:
                successful = (pred != labels)
            
            l2_dist = (adv_images - images).pow(2).sum(dim=[1,2,3]).sqrt()
        
        # Update binary search bounds
        if successful.all():
            c_high = c
            # Update best examples
            for i in range(len(images)):
                if l2_dist[i] < best_l2[i]:
                    best_l2[i] = l2_dist[i]
                    best_adv[i] = adv_images[i]
        else:
            c_low = c
    
    print(f"\nFinal Results:")
    print(f"Average L2 distance: {best_l2.mean():.4f}")
    
    return best_adv

# Test C&W
test_images, test_labels = next(iter(test_loader))
test_images = test_images.to(device)

# For targeted attack, choose random target labels
target_labels = torch.randint(0, 10, (len(test_labels),)).to(device)

adv_images = cw_attack_with_binary_search(
    model, test_images, target_labels, targeted=True
)

C&W Attack Variants

1. C&W L0 (Sparse Perturbations)

  • Minimizes number of pixels changed
  • Uses iterative pixel selection
  • Useful for understanding minimal attack requirements

2. C&W L2 (Most Common)

  • Minimizes Euclidean distance
  • Balanced imperceptibility
  • What we implemented above

3. C&W L∞ (Max Change)

  • Minimizes maximum per-pixel change
  • Comparable to PGD but more optimized
  • Uses different optimization strategy

Comparison Summary

AttackSpeedSuccess RatePerturbation SizeUse Case
FGSMVery FastModerateLargeQuick testing
PGDFastHighMediumStandard evaluation
C&WSlowVery HighMinimalBest-case attack

When to Use Each:

  • FGSM: Initial robustness testing, computational constraints
  • PGD: Standard adversarial training and evaluation
  • C&W: Publication-quality attacks, minimal perturbations, breaking defenses

5. Transferability of Adversarial Examples

Duration: 20 minutes

5.1 The Transferability Phenomenon

Definition: An adversarial example crafted for one model often transfers to other models, even if they have different architectures or were trained on different data.

This is surprising because:

  • Models have different architectures
  • Different random initializations
  • Different training procedures
  • Yet they share similar vulnerabilities

Discovery: Szegedy et al. (2013) first observed that adversarial examples generated for one neural network often fool other networks.

5.2 Why Does Transferability Occur?

Hypotheses:

  1. Shared Decision Boundaries
    • Different models learn similar decision boundaries
    • All models try to approximate the same underlying data distribution
    • Adversarial examples exploit geometry of the data manifold
  2. Linear Approximation
    • Models are locally linear in high dimensions
    • Similar linear approximations across models
    • Perturbations that fool one linear region transfer to others
  3. Gradient Masking
    • Some defenses hide gradients without fixing vulnerabilities
    • Adversarial examples still transfer despite obfuscated gradients
    • Reveals that defense is incomplete
  4. Shared Training Data
    • Models trained on similar data learn similar features
    • Common vulnerabilities in learned representations
    • Transfer more likely between models from same domain

5.3 Factors Affecting Transferability

Model Similarity

High Transfer Probability:

  • Same architecture family (e.g., ResNet-18 → ResNet-50)
  • Similar training data
  • Similar preprocessing
  • Same task/domain

Low Transfer Probability:

  • Very different architectures (CNN → Transformer)
  • Different tasks (ImageNet → medical images)
  • Different modalities (vision → audio)

Attack Method

Transfer Success Rates (Typical):

Attack MethodSame ArchitectureDifferent Architecture
FGSM~60%~30%
PGD (10 iter)~80%~50%
PGD (100 iter)~90%~60%
C&W~95%~70%

Observations:

  • Stronger attacks (more optimization) transfer better
  • Iterative methods > single-step methods
  • Ensemble attacks transfer best

5.4 Ensemble-based Attacks

Strategy: Generate adversarial examples that fool multiple models simultaneously.

Algorithm:

1. Train/collect N different models: {M₁, M₂, ..., Mₙ}
2. Compute ensemble loss:
   L_ensemble = Σᵢ wᵢ · L(Mᵢ(x), y)
3. Optimize adversarial example against ensemble
4. Result transfers well to unseen models

Code Demo: Ensemble Transfer Attack

def ensemble_attack(models, images, labels, epsilon=0.3, alpha=0.01, num_iter=40, weights=None):
    """
    Generate adversarial examples using ensemble of models
    
    Args:
        models: List of models
        images: Input images
        labels: True labels
        epsilon: Perturbation budget
        alpha: Step size
        num_iter: Number of iterations
        weights: Ensemble weights (uniform if None)
    
    Returns:
        Adversarial examples optimized for ensemble
    """
    if weights is None:
        weights = [1.0 / len(models)] * len(models)
    
    images = images.clone().detach().to(device)
    labels = labels.clone().detach().to(device)
    
    # Initialize
    adv_images = images.clone()
    adv_images.requires_grad = True
    
    for i in range(num_iter):
        # Compute ensemble loss
        total_loss = 0
        
        for model, weight in zip(models, weights):
            outputs = model(adv_images)
            loss = F.cross_entropy(outputs, labels)
            total_loss += weight * loss
        
        # Backward pass
        for model in models:
            model.zero_grad()
        total_loss.backward()
        
        # Update
        grad = adv_images.grad.data
        adv_images = adv_images.detach() + alpha * grad.sign()
        
        # Project
        delta = torch.clamp(adv_images - images, -epsilon, epsilon)
        adv_images = images + delta
        adv_images = torch.clamp(adv_images, 0, 1)
        
        adv_images.requires_grad = True
    
    return adv_images.detach()

# Test ensemble attack
def test_ensemble_transfer():
    """
    Test transfer attack using ensemble
    """
    # Load multiple models
    models = [
        torchvision.models.resnet18(pretrained=True).to(device).eval(),
        torchvision.models.resnet34(pretrained=True).to(device).eval(),
        torchvision.models.vgg16(pretrained=True).to(device).eval(),
    ]
    
    # Victim model (not in ensemble)
    victim_model = torchvision.models.densenet121(pretrained=True).to(device).eval()
    
    test_images, test_labels = next(iter(test_loader))
    test_images, test_labels = test_images.to(device), test_labels.to(device)
    
    # Generate ensemble adversarial examples
    adv_images = ensemble_attack(models, test_images, test_labels)
    
    # Test on each ensemble model
    print("Transfer success on ensemble models:")
    for i, model in enumerate(models):
        with torch.no_grad():
            outputs = model(adv_images)
            preds = outputs.argmax(1)
            success = (preds != test_labels).float().mean()
            print(f"  Model {i+1}: {success:.2%}")
    
    # Test on victim model (KEY: this model wasn't in ensemble!)
    with torch.no_grad():
        outputs = victim_model(adv_images)
        preds = outputs.argmax(1)
        success = (preds != test_labels).float().mean()
        print(f"Victim Model (DenseNet): {success:.2%}")
    
    return adv_images

ensemble_adv = test_ensemble_transfer()

Typical Results:

  • Ensemble models: 90-95% attack success
  • Victim model: 60-80% attack success (impressive transfer!)

5.5 Practical Implications

For Attackers (Black-box Scenarios)

Attack Pipeline:

  1. Identify target system (e.g., face recognition API)
  2. Collect similar training data
  3. Train substitute models
  4. Generate ensemble adversarial examples
  5. Test on target system
  6. Success without ever seeing target model!

Real Example:

  • Target: Google Cloud Vision API
  • Substitute: ImageNet-trained ResNets
  • Result: 70%+ transfer success rate

For Defenders

Security Implications:

  1. Security through obscurity doesn't work
    • Hiding model architecture provides little security
    • Attackers can use transfer attacks
  2. Need robust models, not hidden models
    • Adversarial training on diverse architectures
    • Ensemble defenses
    • Input preprocessing
  3. Detection opportunities
    • Transferred examples may be less optimized
    • Slightly larger perturbations
    • Potential for detection mechanisms

5.6 Experimental Activity

Student Exercise (15 minutes):

# TODO for students: Complete this function
def measure_transferability(source_model, target_models, attack_fn, test_loader):
    """
    Measure transfer rates between models
    
    Args:
        source_model: Model to generate adversarial examples on
        target_models: List of models to test transfer
        attack_fn: Function to generate adversarial examples
        test_loader: Test data
    
    Returns:
        Transfer matrix (success rates)
    """
    transfer_rates = []
    
    for images, labels in test_loader:
        # TODO: Generate adversarial examples on source_model
        # TODO: Test on each target_model
        # TODO: Calculate success rates
        pass
    
    return transfer_rates

# Test questions for students:
# 1. Which pairs of models have highest transfer?
# 2. Does attack strength (epsilon) affect transfer rate?
# 3. How does targeted vs untargeted affect transfer?

6. Physical Adversarial Examples

Duration: 20 minutes

6.1 The Challenge of Physical Attacks

Digital vs. Physical Attacks:

AspectDigitalPhysical
Perturbation ControlExactApproximate
EnvironmentControlledVariable
TransformationsNoneViewing angle, lighting, distance
MediumPixelsPhysical objects
PersistenceTemporaryPermanent

Why Physical Attacks Matter:

  • Real-world deployment scenarios (autonomous vehicles, security cameras)
  • Persistent threats (stickers, printed patterns)
  • Harder to detect and remove
  • Demonstrate practical security vulnerabilities

6.2 Physical World Challenges

Environmental Variations:

  1. Viewing Angle
    • Camera perspective changes
    • 3D to 2D projection
    • Occlusion and distortion
  2. Lighting Conditions
    • Shadows and highlights
    • Color shifts
    • Reflections and glare
  3. Distance
    • Resolution changes
    • Focus and blur
    • Scale variations
  4. Printing/Fabrication
    • Color gamut limitations
    • Material properties
    • Texture and finish

The Core Problem:

Digital perturbation → Physical medium → Camera capture → Model input
     (optimized)      (approximation)    (transformations)   (changed)

Adversarial example must survive this entire pipeline!

6.3 Expectation over Transformation (EOT)

Developed by: Athalye et al. (2018)
Key Insight: Optimize adversarial examples to be robust across transformations

Algorithm:

For each optimization iteration:
    1. Sample random transformation T ~ T_distribution
       (e.g., rotation, scaling, lighting change)
    2. Apply T to adversarial example
    3. Compute loss on transformed version
    4. Update perturbation based on expected loss

Mathematical Formulation:

minimize: E_{T~T}[L(f(T(x + δ)), y)]

Where:
- T: Random transformation (rotation, lighting, etc.)
- E_{T~T}: Expectation over transformation distribution
- This makes perturbation robust to transformations

Code Demo: EOT for Physical Robustness

import torchvision.transforms.functional as TF

def eot_attack(model, images, labels, epsilon=0.3, num_iter=100, num_samples=20):
    """
    Expectation over Transformation attack for physical robustness
    
    Args:
        model: Target model
        images: Input images
        labels: True labels
        epsilon: Perturbation budget
        num_iter: Number of optimization iterations
        num_samples: Number of transformations to sample per iteration
    
    Returns:
        Physically robust adversarial examples
    """
    images = images.clone().detach().to(device)
    labels = labels.clone().detach().to(device)
    
    # Initialize perturbation
    delta = torch.zeros_like(images)
    delta.requires_grad = True
    
    optimizer = torch.optim.Adam([delta], lr=0.01)
    
    for iteration in range(num_iter):
        optimizer.zero_grad()
        
        total_loss = 0
        
        # Sample multiple transformations
        for _ in range(num_samples):
            # Apply random transformations
            transformed = apply_random_transform(images + delta)
            
            # Compute loss
            outputs = model(transformed)
            loss = F.cross_entropy(outputs, labels)
            total_loss += loss
        
        # Average loss over transformations
        avg_loss = total_loss / num_samples
        
        # Backward pass
        avg_loss.backward()
        optimizer.step()
        
        # Project perturbation
        with torch.no_grad():
            delta.data = torch.clamp(delta.data, -epsilon, epsilon)
            delta.data = torch.clamp(images + delta.data, 0, 1) - images
    
    return (images + delta).detach()

def apply_random_transform(images):
    """
    Apply random transformations simulating physical world variations
    """
    batch_size = images.shape[0]
    transformed = []
    
    for i in range(batch_size):
        img = images[i]
        
        # Random rotation (-15 to +15 degrees)
        angle = torch.rand(1).item() * 30 - 15
        img = TF.rotate(img, angle)
        
        # Random brightness (0.8 to 1.2)
        brightness = torch.rand(1).item() * 0.4 + 0.8
        img = TF.adjust_brightness(img, brightness)
        
        # Random contrast (0.8 to 1.2)
        contrast = torch.rand(1).item() * 0.4 + 0.8
        img = TF.adjust_contrast(img, contrast)
        
        # Random scaling (0.9 to 1.1)
        scale = torch.rand(1).item() * 0.2 + 0.9
        h, w = img.shape[1], img.shape[2]
        new_h, new_w = int(h * scale), int(w * scale)
        img = TF.resize(img, (new_h, new_w))
        img = TF.center_crop(img, (h, w))
        
        transformed.append(img)
    
    return torch.stack(transformed)

# Test physical robustness
def test_physical_robustness():
    """
    Compare digital-only vs EOT attacks under transformations
    """
    test_images, test_labels = next(iter(test_loader))
    test_images, test_labels = test_images.to(device), test_labels.to(device)
    
    # Generate digital-only adversarial examples
    digital_adv = pgd_attack(model, test_images, test_labels)
    
    # Generate physically robust adversarial examples
    physical_adv = eot_attack(model, test_images, test_labels)
    
    # Test under various transformations
    num_tests = 50
    digital_success = []
    physical_success = []
    
    for _ in range(num_tests):
        # Apply random transformation
        digital_transformed = apply_random_transform(digital_adv)
        physical_transformed = apply_random_transform(physical_adv)
        
        with torch.no_grad():
            # Test digital adversarial
            outputs = model(digital_transformed)
            preds = outputs.argmax(1)
            digital_success.append((preds != test_labels).float().mean().item())
            
            # Test physical adversarial
            outputs = model(physical_transformed)
            preds = outputs.argmax(1)
            physical_success.append((preds != test_labels).float().mean().item())
    
    print(f"Digital-only attack success under transformations: {np.mean(digital_success):.2%}")
    print(f"EOT attack success under transformations: {np.mean(physical_success):.2%}")

test_physical_robustness()

6.4 Case Studies: Real-World Physical Attacks

Case Study 1: Adversarial Stop Signs

Research: Eykholt et al. (2018) - "Robust Physical-World Attacks on Deep Learning Visual Classification"

Attack Scenario:

  • Target: Traffic sign recognition in autonomous vehicles
  • Method: Black and white stickers on stop signs
  • Goal: Misclassify as speed limit or other signs

Results:

  • Success rate: 80%+ in physical world
  • Worked under various lighting and angles
  • Only needed to modify ~20% of sign area
  • Demonstrated serious autonomous vehicle vulnerability

Attack Process:

  1. Print adversarial patterns on stickers
  2. Place on stop sign in specific locations
  3. Attack survives camera capture and processing
  4. Model misclassifies sign

Defenses:

  • Multi-view verification
  • Temporal consistency (video frames)
  • Anomaly detection on sign appearance
  • Redundant sensing modalities

Case Study 2: Adversarial Eyeglasses

Research: Sharif et al. (2016) - "Accessory to the Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition"

Attack Scenario:

  • Target: Face recognition systems
  • Method: Specially designed eyeglass frames
  • Goals:
    • Dodging: Avoid detection
    • Impersonation: Be recognized as someone else

Results:

  • Impersonation success: 100% in some cases
  • Dodging success: High
  • Physically realizable (can be fabricated)
  • Inconspicuous (looks like normal glasses)

Technical Approach:

  1. Optimize eyeglass frame pattern using EOT
  2. Account for different facial expressions and poses
  3. Print on actual glasses
  4. Test on commercial face recognition systems

Case Study 3: Adversarial Patches

Research: Brown et al. (2017) - "Adversarial Patch"

Attack Concept:

  • Small localized patch (can place anywhere in scene)
  • Causes misclassification when captured by camera
  • Independent of object location
  • Universal (one patch works for many images)

Example Applications:

def adversarial_patch_attack(model, patch_size=100, num_iter=1000):
    """
    Generate universal adversarial patch
    
    Args:
        model: Target model
        patch_size: Size of square patch
        num_iter: Optimization iterations
    
    Returns:
        Adversarial patch that can be applied anywhere
    """
    # Initialize random patch
    patch = torch.rand(3, patch_size, patch_size).to(device)
    patch.requires_grad = True
    
    optimizer = torch.optim.Adam([patch], lr=0.01)
    
    for iteration in range(num_iter):
        # Sample random images from dataset
        images, labels = next(iter(train_loader))
        images = images.to(device)
        
        # Apply patch at random locations
        patched_images = apply_patch_random_location(images, patch)
        
        # Optimize for targeted misclassification
        # (e.g., make everything classified as "toaster")
        target_class = 859  # toaster in ImageNet
        targets = torch.full_like(labels, target_class).to(device)
        
        outputs = model(patched_images)
        loss = F.cross_entropy(outputs, targets)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Clip patch to valid range
        with torch.no_grad():
            patch.data = torch.clamp(patch.data, 0, 1)
        
        if iteration % 100 == 0:
            acc = (outputs.argmax(1) == targets).float().mean()
            print(f"Iter {iteration}: Target accuracy = {acc:.2%}")
    
    return patch.detach()

def apply_patch_random_location(images, patch):
    """
    Apply patch at random location in each image
    """
    batch_size, c, h, w = images.shape
    p_h, p_w = patch.shape[1], patch.shape[2]
    
    patched = images.clone()
    
    for i in range(batch_size):
        # Random location
        x = torch.randint(0, w - p_w, (1,)).item()
        y = torch.randint(0, h - p_h, (1,)).item()
        
        # Apply patch
        patched[i, :, y:y+p_h, x:x+p_w] = patch
    
    return patched

Real-World Implications:

  • Attacker can print and place patch
  • Works regardless of scene composition
  • Very practical threat
  • Hard to defend (patch can be anywhere)

6.5 Defenses Against Physical Attacks

Challenges:

  • Physical attacks are harder to defend against
  • Transformations make detection difficult
  • Adversaries can iterate in physical world

Defense Strategies:

  1. Input Preprocessing
    • JPEG compression
    • Total variation minimization
    • Randomized smoothing
    • May reduce attack effectiveness
  2. Adversarial Training
    • Train on EOT-generated examples
    • Improves robustness to transformations
    • Computationally expensive
  3. Multi-Modal Sensing
    • Combine camera with lidar, radar
    • Harder to fool all modalities simultaneously
    • Common in autonomous vehicles
  4. Temporal Consistency
    • Check predictions across video frames
    • Physical objects should be consistent
    • Detect anomalous frame-to-frame changes
  5. Anomaly Detection
    • Detect unusual patterns (stickers, patches)
    • Shape and texture analysis
    • Machine learning for anomaly detection
  6. Certified Defenses
    • Randomized smoothing with provable guarantees
    • Can certify robustness to bounded perturbations
    • Active research area

7. Wrap-up & Discussion

Duration: 10 minutes

7.1 Key Takeaways

What We Learned:

  1. Adversarial Examples are Real
    • Neural networks are fundamentally vulnerable
    • Small perturbations cause dramatic failures
    • Both theoretical and practical threat
  2. Attack Taxonomy
    • White-box (FGSM, PGD, C&W) for strongest attacks
    • Black-box (transfer, query-based) for realistic scenarios
    • Physical attacks for real-world deployment
  3. Transferability is Powerful
    • Adversarial examples transfer across models
    • Enables black-box attacks
    • Security through obscurity fails
  4. Physical Attacks are Practical
    • Real-world demonstrations exist
    • EOT makes attacks robust to transformations
    • Serious implications for deployed systems

7.2 Critical Thinking Questions

Discussion Topics:

  1. Fundamental Question:
    • Are adversarial examples a bug or a feature of machine learning?
    • Can we ever fully eliminate them?
  2. Ethical Considerations:
    • Should researchers publish adversarial attack methods?
    • How to balance security research with potential misuse?
  3. Real-World Deployment:
    • What systems are most at risk?
    • How should organizations respond?
  4. Defense vs. Attack:
    • Is this an arms race with no end?
    • What's the path forward?

7.3 Looking Ahead

Next Week: Data Poisoning & Backdoor Attacks

  • Training-time attacks
  • How attackers can compromise models before deployment
  • Trojan behaviors in neural networks

Connections:

  • Today: Test-time evasion attacks
  • Next week: Training-time poisoning attacks
  • Together: Complete picture of adversarial ML threats

7.4 Assignment Preview

Homework 3: Implementing Adversarial Attacks

Due: Date

Tasks:

  1. Implement FGSM and PGD on CIFAR-10
  2. Evaluate transferability between architectures
  3. Experiment with EOT for robustness
  4. Written report on findings

Rubric:

  • Implementation correctness (40%)
  • Experimental methodology (30%)
  • Analysis and insights (20%)
  • Code quality and documentation (10%)

Starter Code: Will be posted on Canvas

7.5 Resources for Further Study

Seminal Papers:

  1. Szegedy et al. (2013) - "Intriguing properties of neural networks"
  2. Goodfellow et al. (2014) - "Explaining and Harnessing Adversarial Examples"
  3. Madry et al. (2017) - "Towards Deep Learning Models Resistant to Adversarial Attacks"
  4. Carlini & Wagner (2017) - "Towards Evaluating the Robustness of Neural Networks"

Tutorials and Surveys:

  • Adversarial Robustness Toolbox (ART) - IBM
  • CleverHans Library - Google Brain
  • Adversarial ML Reading List - Nicholas Carlini

Online Resources:

  • OpenAI Blog on Adversarial Examples
  • Google AI Blog - Security & Privacy
  • NIST Adversarial ML Framework

Appendix: Code Repositories

Complete Implementation: All code from today's demos is available in the course repository:

/course-materials/week3-adversarial-attacks/
├── fgsm.py
├── pgd.py
├── cw_attack.py
├── eot_physical.py
├── transfer_experiments.py
└── visualization_utils.py

Dependencies:

pip install torch torchvision matplotlib numpy

Quick Start:

git clone [course-repo-url]
cd week3-adversarial-attacks
python fgsm.py --epsilon 0.3 --model resnet18

Questions?

Office Hours: Tuesday/Thursday, 1:00-3:30 PM (Zoom)
Email: zhengxiong.li@ucdenver.edu
Discussion Forum: Canvas

Remember: The best way to understand adversarial attacks is to implement them yourself!


End of Week 3 Tutorial