Week 3: Evasion Attacks & Adversarial Examples

CSCI 5773 - Introduction to Emerging Systems Security

Duration: 140-150 minutes
Module: Adversarial Machine Learning
Instructor: Dr. Zhengxiong Li

Introduction & Motivation (15 min)
Fundamentals of Adversarial Machine Learning (20 min)
White-box vs. Black-box Attacks (15 min)
Attack Methods: FGSM, PGD, C&W (50 min)
Transferability of Adversarial Examples (20 min)
Physical Adversarial Examples (20 min)
Wrap-up & Discussion (10 min)

1. Introduction & Motivation

Duration: 15 minutes

1.1 The Adversarial Machine Learning Problem

Opening Question: What happens when a machine learning model sees something it wasn't designed to handle?

Imagine a self-driving car's vision system that classifies a stop sign as a speed limit sign because someone placed a few carefully designed stickers on it. Or a spam filter that fails to detect malicious emails because attackers subtly modified their wording. These are examples of adversarial attacks on machine learning systems.

1.2 Historical Context

The Szegedy et al. Discovery (2013)

Google researchers discovered that adding imperceptible perturbations to images could fool neural networks
A panda image + carefully crafted noise = classified as "gibbon" with 99.3% confidence
The perturbations are so small that humans cannot detect them
Key insight: Neural networks are vulnerable to inputs specifically designed to exploit their decision boundaries

1.3 Why This Matters

Real-World Security Implications:

Autonomous Vehicles
- Misclassifying traffic signs, pedestrians, or obstacles
- Potential for accidents or unauthorized access
Biometric Authentication
- Fooling face recognition systems
- Bypassing fingerprint or iris scanners
Malware Detection
- Evading antivirus and intrusion detection systems
- Polymorphic malware that adapts to avoid detection
Content Moderation
- Bypassing filters for hate speech, misinformation, or illegal content
- Automated censorship evasion
Medical Diagnosis
- Misclassifying medical images (X-rays, MRIs)
- Potential for incorrect diagnoses and treatments

1.4 Learning Objectives for Today

By the end of this session, you will be able to:

✅ Understand the fundamental concepts of adversarial machine learning
✅ Distinguish between white-box and black-box attack scenarios
✅ Implement basic adversarial attacks (FGSM, PGD, C&W)
✅ Evaluate the transferability of adversarial examples across models
✅ Recognize the challenges of physical-world adversarial attacks
✅ Assess model robustness against adversarial inputs

2. Fundamentals of Adversarial Machine Learning

Duration: 20 minutes

2.1 What is an Adversarial Example?

Definition: An adversarial example is a specially crafted input designed to cause a machine learning model to make a mistake, typically by adding small, carefully calculated perturbations to legitimate inputs.

Mathematical Formulation:

Given:

A classifier function: f(x) = y
An original input: x with true label y_true
A perturbation: δ (delta)

An adversarial example is: x_adv = x + δ

Where:

f(x_adv) ≠ y_true (misclassification occurs)
||δ|| < ε (perturbation is small, typically constrained by epsilon)

2.2 Types of Adversarial Attacks

By Attack Goal:

Untargeted Attacks
- Goal: Cause any misclassification
- Example: Make a "cat" image classified as anything except "cat"
- Easier to achieve
Targeted Attacks
- Goal: Cause misclassification to a specific target class
- Example: Make a "cat" image classified specifically as "dog"
- More challenging, requires more sophisticated perturbations

By Perturbation Constraints:

L∞ (L-infinity) Norm
- Limits maximum change to any single pixel
- ||δ||∞ ≤ ε means no pixel changes by more than ε
- Most commonly used in research
- Example: ε = 0.03 on 0,1 scale
L2 (Euclidean) Norm
- Limits total perturbation energy
- ||δ||₂ ≤ ε constrains the Euclidean distance
- Better represents overall image distortion
L0 Norm
- Limits number of pixels that can be changed
- Sparse perturbations
- Example: Modify only 10% of pixels

2.3 The Threat Model

Understanding adversarial attacks requires defining the threat model - what the attacker knows and can do.

Key Dimensions:

Knowledge of the Model
- White-box: Full access to model architecture, weights, and training data
- Black-box: Only query access (input/output)
- Gray-box: Partial knowledge (e.g., architecture but not weights)
Attack Capability
- Test-time attacks: Modify inputs at inference
- Training-time attacks: Poison training data (covered in Week 4)
Physical Constraints
- Digital attacks: Direct pixel manipulation
- Physical attacks: Real-world modifications (stickers, lighting, etc.)

2.4 Why Are Neural Networks Vulnerable?

Key Factors:

High Dimensionality
- Images have thousands/millions of dimensions
- Small changes in many dimensions accumulate
Linear Nature
- Despite non-linear activations, many models are locally linear
- Small perturbations propagate linearly through layers
Overconfidence
- Models make confident predictions even far from training distribution
- No built-in uncertainty quantification
Decision Boundary Proximity
- Natural images often lie close to decision boundaries
- Easy to push them across with small perturbations

Visual Analogy:

Normal Image Space:
  [Cat Region] | [Dog Region] | [Car Region]
       x       →  |            |
      (cat)       | boundary   |

Adversarial Example:
  [Cat Region] | [Dog Region] | [Car Region]
       x    x_adv             |
      (cat)  → (dog)          |

The perturbation δ pushes the input across the decision boundary.

3. White-box vs. Black-box Attacks

Duration: 15 minutes

3.1 White-box Attacks

Definition: The attacker has complete knowledge of the target model.

Attacker's Knowledge:

✅ Model architecture (layers, activations, etc.)
✅ Model parameters (weights and biases)
✅ Training procedure and hyperparameters
✅ Training data distribution (sometimes)
✅ Gradient information

Advantages:

Can compute exact gradients
Most effective attacks possible
Theoretical worst-case scenario
Useful for robustness testing

Attack Strategy:

Use gradient-based optimization
Leverage backpropagation
Direct computation of optimal perturbations

Example Scenario:

Researcher testing their own model for vulnerabilities
└─> Full access to model internals
    └─> Can compute: ∇_x L(f(x), y_target)

3.2 Black-box Attacks

Definition: The attacker has no knowledge of model internals, only query access.

Attacker's Knowledge:

✅ Input format
✅ Output format (labels, probabilities)
❌ Model architecture
❌ Model parameters
❌ Gradient information

Attack Approaches:

Query-based Attacks
- Submit inputs and observe outputs
- Estimate gradients through finite differences
- Requires many queries (can be expensive/detectable)
Transfer-based Attacks
- Train a substitute model on similar data
- Generate adversarial examples on substitute
- Transfer them to target model
- Exploits transferability property
Decision-based Attacks
- Only observe final decisions (no probabilities)
- Use boundary-following techniques
- Requires even more queries

Example Scenario:

Attacker targeting a cloud ML API
└─> Can only send images and receive predictions
    └─> Must estimate gradients or use transferability

3.3 Comparison Table

Aspect	White-box	Black-box
Model Access	Complete	Query only
Gradient Info	Available	Must estimate
Attack Success Rate	Highest	Lower
Queries Required	Few	Many
Realism	Lower (rarely have full access)	Higher
Detection Risk	Lower	Higher (many queries)
Computation	Efficient	Expensive

3.4 Gray-box Attacks (Intermediate)

Definition: Partial knowledge of the model.

Common Scenarios:

Know architecture but not weights (e.g., public model types)
Have access to similar models from same vendor
Know training methodology but not exact data

Strategy:

Use architecture knowledge to build substitute
Fine-tune on available data
Apply transfer attacks

4. Attack Methods: FGSM, PGD, C&W

Duration: 50 minutes

4.1 Fast Gradient Sign Method (FGSM)

Duration: 15 minutes

Developed by: Ian Goodfellow et al. (2014)
Type: White-box, single-step attack
Complexity: Low (fast and simple)

The Core Idea

FGSM linearizes the loss function around the current input and takes a single step in the direction that maximizes loss.

Mathematical Formulation:

x_adv = x + ε · sign(∇_x L(f(x), y_true))

Where:

x: Original input
ε: Perturbation magnitude (typically 0.01 to 0.3)
∇_x L(f(x), y_true): Gradient of loss w.r.t. input
sign(): Takes the sign of each gradient component (+1, 0, or -1)
y_true: True label (for untargeted) or target label (for targeted)

Intuition

Think of the loss function as a hill:

For untargeted attacks: Climb the hill (increase loss) → misclassification
For targeted attacks: Go downhill toward target class

The sign() function ensures we move the same distance (ε) in each dimension, regardless of gradient magnitude.

Step-by-Step Process

Forward pass: Compute model prediction on original input
Compute loss: Calculate loss w.r.t. true/target label
Backward pass: Compute gradient of loss w.r.t. input
Generate perturbation: Take sign of gradient, multiply by ε
Create adversarial example: Add perturbation to original input
Clip values: Ensure pixels remain in valid range 0, 1

Code Demo: FGSM Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np

# Load a pre-trained model (ResNet-18 for CIFAR-10)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torchvision.models.resnet18(pretrained=True)
model = model.to(device)
model.eval()

def fgsm_attack(image, epsilon, data_grad):
    """
    FGSM attack implementation
    
    Args:
        image: Original input image (tensor)
        epsilon: Perturbation magnitude
        data_grad: Gradient of loss w.r.t. input
    
    Returns:
        Perturbed image
    """
    # Collect the sign of the data gradient
    sign_data_grad = data_grad.sign()
    
    # Create the perturbed image
    perturbed_image = image + epsilon * sign_data_grad
    
    # Clip to maintain valid pixel range [0, 1]
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    
    return perturbed_image

def test_fgsm(model, device, test_loader, epsilon):
    """
    Test FGSM attack on a dataset
    """
    correct = 0
    adv_examples = []
    
    for data, target in test_loader:
        data, target = data.to(device), target.to(device)
        
        # Set requires_grad to true for gradient computation
        data.requires_grad = True
        
        # Forward pass
        output = model(data)
        init_pred = output.max(1, keepdim=True)[1]
        
        # Skip if initially incorrect
        if init_pred.item() != target.item():
            continue
        
        # Calculate loss
        loss = F.cross_entropy(output, target)
        
        # Zero gradients
        model.zero_grad()
        
        # Backward pass
        loss.backward()
        
        # Collect gradient
        data_grad = data.grad.data
        
        # Generate adversarial example
        perturbed_data = fgsm_attack(data, epsilon, data_grad)
        
        # Re-classify
        output = model(perturbed_data)
        final_pred = output.max(1, keepdim=True)[1]
        
        if final_pred.item() == target.item():
            correct += 1
        else:
            # Save adversarial example for visualization
            if len(adv_examples) < 5:
                adv_ex = perturbed_data.squeeze().detach().cpu().numpy()
                adv_examples.append((init_pred.item(), final_pred.item(), adv_ex))
    
    # Calculate accuracy
    accuracy = correct / len(test_loader)
    print(f"Epsilon: {epsilon}\tAccuracy: {accuracy:.2f}")
    
    return accuracy, adv_examples

# Example usage
epsilons = [0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3]
accuracies = []

for eps in epsilons:
    acc, ex = test_fgsm(model, device, test_loader, eps)
    accuracies.append(acc)

# Plot accuracy vs epsilon
plt.figure(figsize=(8, 6))
plt.plot(epsilons, accuracies, "*-")
plt.xlabel("Epsilon")
plt.ylabel("Accuracy")
plt.title("Model Accuracy vs. FGSM Epsilon")
plt.show()

Strengths and Limitations

Strengths:

✅ Extremely fast (single gradient computation)
✅ Simple to implement
✅ Good for testing baseline robustness
✅ Works well for small ε values

Limitations:

❌ Single-step method (not optimal)
❌ Lower success rate than iterative methods
❌ Can be easily defended against
❌ Limited transferability

Targeted FGSM Variant

For targeted attacks (forcing classification to specific target):

def fgsm_targeted(image, epsilon, data_grad):
    """
    Targeted FGSM - minimize loss for target class
    """
    # Note the negative sign (gradient descent instead of ascent)
    sign_data_grad = data_grad.sign()
    perturbed_image = image - epsilon * sign_data_grad
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    return perturbed_image

4.2 Projected Gradient Descent (PGD)

Duration: 20 minutes

Developed by: Madry et al. (2017)
Type: White-box, iterative attack
Complexity: Medium (multiple iterations)

The Core Idea

PGD is an iterative version of FGSM that takes multiple smaller steps and projects back onto the allowed perturbation space. It's considered one of the strongest first-order adversarial attacks.

Why Iterative?

Single-step FGSM is suboptimal
Multiple small steps find better adversarial examples
Can escape local minima
Achieves higher attack success rates

Mathematical Formulation:

x₀ = x + uniform_noise(-ε, ε)  # Random initialization

For t = 0 to T-1:
    x_{t+1} = Π_{x+S}(x_t + α · sign(∇_x L(f(x_t), y)))

Where:

T: Number of iterations (typically 10-100)
α: Step size (typically ε/T or smaller)
Π_{x+S}: Projection operator that clips to allowed space
S: Allowed perturbation region (L∞ ball of radius ε)
Random initialization helps escape local minima

Key Components

Random Start
- Initialize within ε-ball around original input
- Helps find stronger adversarial examples
- Prevents getting stuck in local optima
Iterative Updates
- Take multiple gradient steps
- Each step smaller than FGSM (α << ε)
- More thorough exploration of loss landscape
Projection
- After each step, project back to allowed region
- Ensures ||x_adv - x||∞ ≤ ε
- Maintains perturbation constraint

Projection Operator Explained

The projection ensures we stay within the ε-ball:

def project(x, x_orig, epsilon):
    """
    Project x onto L-infinity ball around x_orig with radius epsilon
    """
    # Clip perturbation to [-epsilon, epsilon]
    delta = torch.clamp(x - x_orig, -epsilon, epsilon)
    
    # Add back to original
    x_proj = x_orig + delta
    
    # Clip to valid pixel range [0, 1]
    x_proj = torch.clamp(x_proj, 0, 1)
    
    return x_proj

Code Demo: PGD Implementation

def pgd_attack(model, images, labels, epsilon=0.3, alpha=0.01, num_iter=40, random_start=True):
    """
    Projected Gradient Descent attack
    
    Args:
        model: Target model
        images: Input images
        labels: True labels (for untargeted) or target labels (for targeted)
        epsilon: Maximum perturbation (L-infinity norm)
        alpha: Step size per iteration
        num_iter: Number of iterations
        random_start: Whether to use random initialization
    
    Returns:
        Adversarial examples
    """
    images = images.clone().detach().to(device)
    labels = labels.clone().detach().to(device)
    
    # Random initialization within epsilon ball
    if random_start:
        delta = torch.empty_like(images).uniform_(-epsilon, epsilon)
        delta = torch.clamp(delta, 0-images, 1-images)  # Keep in valid range
        adv_images = images + delta
    else:
        adv_images = images.clone()
    
    adv_images.requires_grad = True
    
    for i in range(num_iter):
        # Forward pass
        outputs = model(adv_images)
        
        # Calculate loss
        loss = F.cross_entropy(outputs, labels)
        
        # Backward pass
        model.zero_grad()
        loss.backward()
        
        # Get gradient
        grad = adv_images.grad.data
        
        # Update adversarial images
        adv_images = adv_images.detach() + alpha * grad.sign()
        
        # Project back to epsilon ball
        delta = torch.clamp(adv_images - images, -epsilon, epsilon)
        adv_images = images + delta
        
        # Clip to valid pixel range
        adv_images = torch.clamp(adv_images, 0, 1)
        
        adv_images.requires_grad = True
    
    return adv_images.detach()

# Example usage with varying parameters
def test_pgd_variants():
    """
    Compare PGD with different parameters
    """
    test_image, test_label = next(iter(test_loader))
    test_image, test_label = test_image.to(device), test_label.to(device)
    
    configs = [
        {'num_iter': 10, 'alpha': 0.03, 'name': 'PGD-10'},
        {'num_iter': 40, 'alpha': 0.01, 'name': 'PGD-40'},
        {'num_iter': 100, 'alpha': 0.003, 'name': 'PGD-100'},
    ]
    
    epsilon = 0.3
    
    for config in configs:
        adv_images = pgd_attack(
            model, test_image, test_label,
            epsilon=epsilon,
            alpha=config['alpha'],
            num_iter=config['num_iter']
        )
        
        # Evaluate
        with torch.no_grad():
            orig_output = model(test_image)
            adv_output = model(adv_images)
            
            orig_pred = orig_output.argmax(1)
            adv_pred = adv_output.argmax(1)
            
            success = (orig_pred != adv_pred).sum().item()
            
        print(f"{config['name']}: Success Rate = {success}/{len(test_label)}")
        
        # Visualize perturbation
        perturbation = (adv_images - test_image).abs()
        print(f"  L∞ norm: {perturbation.max():.4f}")
        print(f"  L2 norm: {perturbation.norm():.4f}")

test_pgd_variants()

PGD Variants

PGD-∞ (L-infinity constrained)
- What we've described above
- Most common in research
- Constrains maximum per-pixel change
PGD-2 (L2 constrained)
- Constrains total Euclidean distance
- Different projection operator
def project_l2(x, x_orig, epsilon): delta = x - x_orig delta_norm = delta.norm(p=2) if delta_norm > epsilon: delta = delta * (epsilon / delta_norm) return x_orig + delta
PGD with Momentum
- Accumulates gradient history
- Helps escape local minima
- Similar to momentum in optimization

PGD as Universal Attack Standard

PGD is often considered the gold standard for evaluating adversarial robustness:

Strong enough to find most vulnerabilities
Computationally tractable
Well-understood theoretically
Used in adversarial training (defense method)

Trade-offs:

Aspect	FGSM	PGD
Speed	Very fast (1 iter)	Slower (40-100 iters)
Success Rate	Moderate	High
Perturbation Efficiency	Lower	Higher
Use Case	Quick testing	Thorough evaluation

4.3 Carlini & Wagner (C&W) Attack

Duration: 15 minutes

Developed by: Nicholas Carlini and David Wagner (2017)
Type: White-box, optimization-based attack
Complexity: High (but very effective)

The Core Idea

C&W reformulates adversarial example generation as an optimization problem with carefully designed loss function. Instead of following gradients of standard classification loss, C&W optimizes a custom objective that balances:

Misclassification (attack success)
Perturbation minimization (imperceptibility)

This produces minimal perturbations that reliably fool the model.

Mathematical Formulation

Optimization Problem:

minimize: ||δ||_p + c · f(x + δ)

subject to: x + δ ∈ [0, 1]^n

Where:

δ: Perturbation to be optimized
||δ||_p: Perturbation magnitude (L0, L2, or L∞)
c: Confidence parameter (balances two objectives)
f(): Objective function measuring attack success

The f() Function:

Instead of directly using classification loss, C&W introduces:

f(x') = max(max{Z(x')_i : i ≠ t} - Z(x')_t, -κ)

Where:

Z(x'): Logits (pre-softmax outputs) for input x'
t: Target class
κ (kappa): Confidence parameter
max{Z(x')_i : i ≠ t}: Highest logit for non-target classes

Intuition:

When f(x') < 0: Attack succeeds (target logit is highest)
κ controls confidence margin (how strongly we want target class to win)
c balances attack success vs. perturbation size

Why C&W is Different

Advantages over FGSM/PGD:

Minimal Perturbations
- Finds smallest possible perturbation
- More realistic threat model
- Harder to detect
High Success Rate
- Near 100% success on many models
- Works even against defensive distillation
- Very hard to defend against
Confidence Control
- Can specify confidence of misclassification
- κ parameter ensures robust adversarial examples
Different Norms
- L0: Minimizes number of changed pixels (sparse)
- L2: Minimizes Euclidean distance (most common)
- L∞: Minimizes maximum per-pixel change

Disadvantages:

Computational Cost
- Much slower than FGSM/PGD
- Requires solving optimization problem per example
- Can take seconds to minutes per image
Complexity
- More parameters to tune (c, κ, learning rate)
- Requires careful initialization
- Binary search for optimal c

Change of Variables Trick

To enforce x + δ ∈ [0, 1], C&W uses a clever change of variables:

x_adv = 0.5(tanh(w) + 1)

Where w is the optimization variable. This ensures:

tanh(w) ∈ [-1, 1]
x_adv ∈ [0, 1] automatically
No need for explicit clipping

Then:

δ = x_adv - x = 0.5(tanh(w) + 1) - x

Code Demo: C&W L2 Attack

def cw_l2_attack(model, images, labels, targeted=True, c=1, kappa=0, max_iter=1000, learning_rate=0.01):
    """
    Carlini & Wagner L2 attack
    
    Args:
        model: Target model
        images: Input images
        labels: Target labels (for targeted attack)
        c: Weight of attack loss
        kappa: Confidence parameter
        max_iter: Maximum optimization iterations
        learning_rate: Learning rate for optimizer
    
    Returns:
        Adversarial examples
    """
    images = images.clone().detach().to(device)
    labels = labels.clone().detach().to(device)
    
    # Initialize w (will be optimized)
    # tanh^-1(2*x - 1) maps [0,1] to real numbers
    w = torch.arctanh((2 * images - 1) * 0.999)  # 0.999 to avoid infinity
    w.requires_grad = True
    
    # Optimizer
    optimizer = torch.optim.Adam([w], lr=learning_rate)
    
    best_adv = images.clone()
    best_l2 = float('inf') * torch.ones(images.shape[0]).to(device)
    
    for iteration in range(max_iter):
        optimizer.zero_grad()
        
        # Convert w to adversarial example
        adv_images = 0.5 * (torch.tanh(w) + 1)
        
        # Get model output (logits)
        logits = model(adv_images)
        
        # L2 distance
        l2_dist = (adv_images - images).pow(2).sum(dim=[1,2,3]).sqrt()
        
        # Attack loss
        if targeted:
            # For targeted: we want target_logit - max_other_logit > kappa
            target_logits = logits[range(len(labels)), labels]
            other_logits = logits.clone()
            other_logits[range(len(labels)), labels] = -float('inf')
            max_other_logits = other_logits.max(1)[0]
            
            f_loss = torch.clamp(max_other_logits - target_logits + kappa, min=0)
        else:
            # For untargeted: we want max_other_logit - true_logit > kappa
            true_logits = logits[range(len(labels)), labels]
            other_logits = logits.clone()
            other_logits[range(len(labels)), labels] = -float('inf')
            max_other_logits = other_logits.max(1)[0]
            
            f_loss = torch.clamp(true_logits - max_other_logits + kappa, min=0)
        
        # Total loss: L2 + c * attack_loss
        loss = l2_dist.sum() + c * f_loss.sum()
        
        # Backpropagation
        loss.backward()
        optimizer.step()
        
        # Update best adversarial examples
        pred_labels = logits.argmax(1)
        
        if targeted:
            successful = (pred_labels == labels)
        else:
            successful = (pred_labels != labels)
        
        for i in range(len(images)):
            if successful[i] and l2_dist[i] < best_l2[i]:
                best_l2[i] = l2_dist[i]
                best_adv[i] = adv_images[i]
        
        # Print progress
        if iteration % 100 == 0:
            success_rate = successful.float().mean()
            avg_l2 = l2_dist[successful].mean() if successful.any() else float('inf')
            print(f"Iter {iteration}: Success={success_rate:.2%}, Avg L2={avg_l2:.4f}")
    
    return best_adv

# Example usage with binary search for c
def cw_attack_with_binary_search(model, images, labels, targeted=True):
    """
    C&W attack with binary search for optimal c
    """
    # Binary search parameters
    c_low = 0
    c_high = 1
    num_binary_search = 9
    
    best_adv = images.clone()
    best_l2 = float('inf') * torch.ones(images.shape[0])
    
    for search_iter in range(num_binary_search):
        c = (c_low + c_high) / 2
        
        print(f"\n=== Binary Search Iteration {search_iter}, c={c:.4f} ===")
        
        adv_images = cw_l2_attack(
            model, images, labels,
            targeted=targeted,
            c=c,
            max_iter=1000
        )
        
        # Check success
        with torch.no_grad():
            logits = model(adv_images)
            pred = logits.argmax(1)
            
            if targeted:
                successful = (pred == labels)
            else:
                successful = (pred != labels)
            
            l2_dist = (adv_images - images).pow(2).sum(dim=[1,2,3]).sqrt()
        
        # Update binary search bounds
        if successful.all():
            c_high = c
            # Update best examples
            for i in range(len(images)):
                if l2_dist[i] < best_l2[i]:
                    best_l2[i] = l2_dist[i]
                    best_adv[i] = adv_images[i]
        else:
            c_low = c
    
    print(f"\nFinal Results:")
    print(f"Average L2 distance: {best_l2.mean():.4f}")
    
    return best_adv

# Test C&W
test_images, test_labels = next(iter(test_loader))
test_images = test_images.to(device)

# For targeted attack, choose random target labels
target_labels = torch.randint(0, 10, (len(test_labels),)).to(device)

adv_images = cw_attack_with_binary_search(
    model, test_images, target_labels, targeted=True
)

C&W Attack Variants

1. C&W L0 (Sparse Perturbations)

Minimizes number of pixels changed
Uses iterative pixel selection
Useful for understanding minimal attack requirements

2. C&W L2 (Most Common)

Minimizes Euclidean distance
Balanced imperceptibility
What we implemented above

3. C&W L∞ (Max Change)

Minimizes maximum per-pixel change
Comparable to PGD but more optimized
Uses different optimization strategy

Comparison Summary

Attack	Speed	Success Rate	Perturbation Size	Use Case
FGSM	Very Fast	Moderate	Large	Quick testing
PGD	Fast	High	Medium	Standard evaluation
C&W	Slow	Very High	Minimal	Best-case attack

When to Use Each:

FGSM: Initial robustness testing, computational constraints
PGD: Standard adversarial training and evaluation
C&W: Publication-quality attacks, minimal perturbations, breaking defenses

5. Transferability of Adversarial Examples

Duration: 20 minutes

5.1 The Transferability Phenomenon

Definition: An adversarial example crafted for one model often transfers to other models, even if they have different architectures or were trained on different data.

This is surprising because:

Models have different architectures
Different random initializations
Different training procedures
Yet they share similar vulnerabilities

Discovery: Szegedy et al. (2013) first observed that adversarial examples generated for one neural network often fool other networks.

5.2 Why Does Transferability Occur?

Hypotheses:

Shared Decision Boundaries
- Different models learn similar decision boundaries
- All models try to approximate the same underlying data distribution
- Adversarial examples exploit geometry of the data manifold
Linear Approximation
- Models are locally linear in high dimensions
- Similar linear approximations across models
- Perturbations that fool one linear region transfer to others
Gradient Masking
- Some defenses hide gradients without fixing vulnerabilities
- Adversarial examples still transfer despite obfuscated gradients
- Reveals that defense is incomplete
Shared Training Data
- Models trained on similar data learn similar features
- Common vulnerabilities in learned representations
- Transfer more likely between models from same domain

5.3 Factors Affecting Transferability

Model Similarity

High Transfer Probability:

Same architecture family (e.g., ResNet-18 → ResNet-50)
Similar training data
Similar preprocessing
Same task/domain

Low Transfer Probability:

Very different architectures (CNN → Transformer)
Different tasks (ImageNet → medical images)
Different modalities (vision → audio)

Attack Method

Transfer Success Rates (Typical):

Attack Method	Same Architecture	Different Architecture
FGSM	~60%	~30%
PGD (10 iter)	~80%	~50%
PGD (100 iter)	~90%	~60%
C&W	~95%	~70%

Observations:

Stronger attacks (more optimization) transfer better
Iterative methods > single-step methods
Ensemble attacks transfer best

5.4 Ensemble-based Attacks

Strategy: Generate adversarial examples that fool multiple models simultaneously.

Algorithm:

1. Train/collect N different models: {M₁, M₂, ..., Mₙ}
2. Compute ensemble loss:
   L_ensemble = Σᵢ wᵢ · L(Mᵢ(x), y)
3. Optimize adversarial example against ensemble
4. Result transfers well to unseen models

Code Demo: Ensemble Transfer Attack

def ensemble_attack(models, images, labels, epsilon=0.3, alpha=0.01, num_iter=40, weights=None):
    """
    Generate adversarial examples using ensemble of models
    
    Args:
        models: List of models
        images: Input images
        labels: True labels
        epsilon: Perturbation budget
        alpha: Step size
        num_iter: Number of iterations
        weights: Ensemble weights (uniform if None)
    
    Returns:
        Adversarial examples optimized for ensemble
    """
    if weights is None:
        weights = [1.0 / len(models)] * len(models)
    
    images = images.clone().detach().to(device)
    labels = labels.clone().detach().to(device)
    
    # Initialize
    adv_images = images.clone()
    adv_images.requires_grad = True
    
    for i in range(num_iter):
        # Compute ensemble loss
        total_loss = 0
        
        for model, weight in zip(models, weights):
            outputs = model(adv_images)
            loss = F.cross_entropy(outputs, labels)
            total_loss += weight * loss
        
        # Backward pass
        for model in models:
            model.zero_grad()
        total_loss.backward()
        
        # Update
        grad = adv_images.grad.data
        adv_images = adv_images.detach() + alpha * grad.sign()
        
        # Project
        delta = torch.clamp(adv_images - images, -epsilon, epsilon)
        adv_images = images + delta
        adv_images = torch.clamp(adv_images, 0, 1)
        
        adv_images.requires_grad = True
    
    return adv_images.detach()

# Test ensemble attack
def test_ensemble_transfer():
    """
    Test transfer attack using ensemble
    """
    # Load multiple models
    models = [
        torchvision.models.resnet18(pretrained=True).to(device).eval(),
        torchvision.models.resnet34(pretrained=True).to(device).eval(),
        torchvision.models.vgg16(pretrained=True).to(device).eval(),
    ]
    
    # Victim model (not in ensemble)
    victim_model = torchvision.models.densenet121(pretrained=True).to(device).eval()
    
    test_images, test_labels = next(iter(test_loader))
    test_images, test_labels = test_images.to(device), test_labels.to(device)
    
    # Generate ensemble adversarial examples
    adv_images = ensemble_attack(models, test_images, test_labels)
    
    # Test on each ensemble model
    print("Transfer success on ensemble models:")
    for i, model in enumerate(models):
        with torch.no_grad():
            outputs = model(adv_images)
            preds = outputs.argmax(1)
            success = (preds != test_labels).float().mean()
            print(f"  Model {i+1}: {success:.2%}")
    
    # Test on victim model (KEY: this model wasn't in ensemble!)
    with torch.no_grad():
        outputs = victim_model(adv_images)
        preds = outputs.argmax(1)
        success = (preds != test_labels).float().mean()
        print(f"Victim Model (DenseNet): {success:.2%}")
    
    return adv_images

ensemble_adv = test_ensemble_transfer()

Typical Results:

Ensemble models: 90-95% attack success
Victim model: 60-80% attack success (impressive transfer!)

5.5 Practical Implications

For Attackers (Black-box Scenarios)

Attack Pipeline:

Identify target system (e.g., face recognition API)
Collect similar training data
Train substitute models
Generate ensemble adversarial examples
Test on target system
Success without ever seeing target model!

Real Example:

Target: Google Cloud Vision API
Substitute: ImageNet-trained ResNets
Result: 70%+ transfer success rate

For Defenders

Security Implications:

Security through obscurity doesn't work
- Hiding model architecture provides little security
- Attackers can use transfer attacks
Need robust models, not hidden models
- Adversarial training on diverse architectures
- Ensemble defenses
- Input preprocessing
Detection opportunities
- Transferred examples may be less optimized
- Slightly larger perturbations
- Potential for detection mechanisms

5.6 Experimental Activity

Student Exercise (15 minutes):

# TODO for students: Complete this function
def measure_transferability(source_model, target_models, attack_fn, test_loader):
    """
    Measure transfer rates between models
    
    Args:
        source_model: Model to generate adversarial examples on
        target_models: List of models to test transfer
        attack_fn: Function to generate adversarial examples
        test_loader: Test data
    
    Returns:
        Transfer matrix (success rates)
    """
    transfer_rates = []
    
    for images, labels in test_loader:
        # TODO: Generate adversarial examples on source_model
        # TODO: Test on each target_model
        # TODO: Calculate success rates
        pass
    
    return transfer_rates

# Test questions for students:
# 1. Which pairs of models have highest transfer?
# 2. Does attack strength (epsilon) affect transfer rate?
# 3. How does targeted vs untargeted affect transfer?

6. Physical Adversarial Examples

Duration: 20 minutes

6.1 The Challenge of Physical Attacks

Digital vs. Physical Attacks:

Aspect	Digital	Physical
Perturbation Control	Exact	Approximate
Environment	Controlled	Variable
Transformations	None	Viewing angle, lighting, distance
Medium	Pixels	Physical objects
Persistence	Temporary	Permanent

Why Physical Attacks Matter:

Real-world deployment scenarios (autonomous vehicles, security cameras)
Persistent threats (stickers, printed patterns)
Harder to detect and remove
Demonstrate practical security vulnerabilities

6.2 Physical World Challenges

Environmental Variations:

Viewing Angle
- Camera perspective changes
- 3D to 2D projection
- Occlusion and distortion
Lighting Conditions
- Shadows and highlights
- Color shifts
- Reflections and glare
Distance
- Resolution changes
- Focus and blur
- Scale variations
Printing/Fabrication
- Color gamut limitations
- Material properties
- Texture and finish

The Core Problem:

Digital perturbation → Physical medium → Camera capture → Model input
     (optimized)      (approximation)    (transformations)   (changed)

Adversarial example must survive this entire pipeline!

6.3 Expectation over Transformation (EOT)

Developed by: Athalye et al. (2018)
Key Insight: Optimize adversarial examples to be robust across transformations

Algorithm:

For each optimization iteration:
    1. Sample random transformation T ~ T_distribution
       (e.g., rotation, scaling, lighting change)
    2. Apply T to adversarial example
    3. Compute loss on transformed version
    4. Update perturbation based on expected loss

Mathematical Formulation:

minimize: E_{T~T}[L(f(T(x + δ)), y)]

Where:
- T: Random transformation (rotation, lighting, etc.)
- E_{T~T}: Expectation over transformation distribution
- This makes perturbation robust to transformations

Code Demo: EOT for Physical Robustness

import torchvision.transforms.functional as TF

def eot_attack(model, images, labels, epsilon=0.3, num_iter=100, num_samples=20):
    """
    Expectation over Transformation attack for physical robustness
    
    Args:
        model: Target model
        images: Input images
        labels: True labels
        epsilon: Perturbation budget
        num_iter: Number of optimization iterations
        num_samples: Number of transformations to sample per iteration
    
    Returns:
        Physically robust adversarial examples
    """
    images = images.clone().detach().to(device)
    labels = labels.clone().detach().to(device)
    
    # Initialize perturbation
    delta = torch.zeros_like(images)
    delta.requires_grad = True
    
    optimizer = torch.optim.Adam([delta], lr=0.01)
    
    for iteration in range(num_iter):
        optimizer.zero_grad()
        
        total_loss = 0
        
        # Sample multiple transformations
        for _ in range(num_samples):
            # Apply random transformations
            transformed = apply_random_transform(images + delta)
            
            # Compute loss
            outputs = model(transformed)
            loss = F.cross_entropy(outputs, labels)
            total_loss += loss
        
        # Average loss over transformations
        avg_loss = total_loss / num_samples
        
        # Backward pass
        avg_loss.backward()
        optimizer.step()
        
        # Project perturbation
        with torch.no_grad():
            delta.data = torch.clamp(delta.data, -epsilon, epsilon)
            delta.data = torch.clamp(images + delta.data, 0, 1) - images
    
    return (images + delta).detach()

def apply_random_transform(images):
    """
    Apply random transformations simulating physical world variations
    """
    batch_size = images.shape[0]
    transformed = []
    
    for i in range(batch_size):
        img = images[i]
        
        # Random rotation (-15 to +15 degrees)
        angle = torch.rand(1).item() * 30 - 15
        img = TF.rotate(img, angle)
        
        # Random brightness (0.8 to 1.2)
        brightness = torch.rand(1).item() * 0.4 + 0.8
        img = TF.adjust_brightness(img, brightness)
        
        # Random contrast (0.8 to 1.2)
        contrast = torch.rand(1).item() * 0.4 + 0.8
        img = TF.adjust_contrast(img, contrast)
        
        # Random scaling (0.9 to 1.1)
        scale = torch.rand(1).item() * 0.2 + 0.9
        h, w = img.shape[1], img.shape[2]
        new_h, new_w = int(h * scale), int(w * scale)
        img = TF.resize(img, (new_h, new_w))
        img = TF.center_crop(img, (h, w))
        
        transformed.append(img)
    
    return torch.stack(transformed)

# Test physical robustness
def test_physical_robustness():
    """
    Compare digital-only vs EOT attacks under transformations
    """
    test_images, test_labels = next(iter(test_loader))
    test_images, test_labels = test_images.to(device), test_labels.to(device)
    
    # Generate digital-only adversarial examples
    digital_adv = pgd_attack(model, test_images, test_labels)
    
    # Generate physically robust adversarial examples
    physical_adv = eot_attack(model, test_images, test_labels)
    
    # Test under various transformations
    num_tests = 50
    digital_success = []
    physical_success = []
    
    for _ in range(num_tests):
        # Apply random transformation
        digital_transformed = apply_random_transform(digital_adv)
        physical_transformed = apply_random_transform(physical_adv)
        
        with torch.no_grad():
            # Test digital adversarial
            outputs = model(digital_transformed)
            preds = outputs.argmax(1)
            digital_success.append((preds != test_labels).float().mean().item())
            
            # Test physical adversarial
            outputs = model(physical_transformed)
            preds = outputs.argmax(1)
            physical_success.append((preds != test_labels).float().mean().item())
    
    print(f"Digital-only attack success under transformations: {np.mean(digital_success):.2%}")
    print(f"EOT attack success under transformations: {np.mean(physical_success):.2%}")

test_physical_robustness()

6.4 Case Studies: Real-World Physical Attacks

Case Study 1: Adversarial Stop Signs

Research: Eykholt et al. (2018) - "Robust Physical-World Attacks on Deep Learning Visual Classification"

Attack Scenario:

Target: Traffic sign recognition in autonomous vehicles
Method: Black and white stickers on stop signs
Goal: Misclassify as speed limit or other signs

Results:

Success rate: 80%+ in physical world
Worked under various lighting and angles
Only needed to modify ~20% of sign area
Demonstrated serious autonomous vehicle vulnerability

Attack Process:

Print adversarial patterns on stickers
Place on stop sign in specific locations
Attack survives camera capture and processing
Model misclassifies sign

Defenses:

Multi-view verification
Temporal consistency (video frames)
Anomaly detection on sign appearance
Redundant sensing modalities

Case Study 2: Adversarial Eyeglasses

Research: Sharif et al. (2016) - "Accessory to the Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition"

Attack Scenario:

Target: Face recognition systems
Method: Specially designed eyeglass frames
Goals:
- Dodging: Avoid detection
- Impersonation: Be recognized as someone else

Results:

Impersonation success: 100% in some cases
Dodging success: High
Physically realizable (can be fabricated)
Inconspicuous (looks like normal glasses)

Technical Approach:

Optimize eyeglass frame pattern using EOT
Account for different facial expressions and poses
Print on actual glasses
Test on commercial face recognition systems

Case Study 3: Adversarial Patches

Research: Brown et al. (2017) - "Adversarial Patch"

Attack Concept:

Small localized patch (can place anywhere in scene)
Causes misclassification when captured by camera
Independent of object location
Universal (one patch works for many images)

Example Applications:

def adversarial_patch_attack(model, patch_size=100, num_iter=1000):
    """
    Generate universal adversarial patch
    
    Args:
        model: Target model
        patch_size: Size of square patch
        num_iter: Optimization iterations
    
    Returns:
        Adversarial patch that can be applied anywhere
    """
    # Initialize random patch
    patch = torch.rand(3, patch_size, patch_size).to(device)
    patch.requires_grad = True
    
    optimizer = torch.optim.Adam([patch], lr=0.01)
    
    for iteration in range(num_iter):
        # Sample random images from dataset
        images, labels = next(iter(train_loader))
        images = images.to(device)
        
        # Apply patch at random locations
        patched_images = apply_patch_random_location(images, patch)
        
        # Optimize for targeted misclassification
        # (e.g., make everything classified as "toaster")
        target_class = 859  # toaster in ImageNet
        targets = torch.full_like(labels, target_class).to(device)
        
        outputs = model(patched_images)
        loss = F.cross_entropy(outputs, targets)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Clip patch to valid range
        with torch.no_grad():
            patch.data = torch.clamp(patch.data, 0, 1)
        
        if iteration % 100 == 0:
            acc = (outputs.argmax(1) == targets).float().mean()
            print(f"Iter {iteration}: Target accuracy = {acc:.2%}")
    
    return patch.detach()

def apply_patch_random_location(images, patch):
    """
    Apply patch at random location in each image
    """
    batch_size, c, h, w = images.shape
    p_h, p_w = patch.shape[1], patch.shape[2]
    
    patched = images.clone()
    
    for i in range(batch_size):
        # Random location
        x = torch.randint(0, w - p_w, (1,)).item()
        y = torch.randint(0, h - p_h, (1,)).item()
        
        # Apply patch
        patched[i, :, y:y+p_h, x:x+p_w] = patch
    
    return patched

Real-World Implications:

Attacker can print and place patch
Works regardless of scene composition
Very practical threat
Hard to defend (patch can be anywhere)

6.5 Defenses Against Physical Attacks

Challenges:

Physical attacks are harder to defend against
Transformations make detection difficult
Adversaries can iterate in physical world

Defense Strategies:

Input Preprocessing
- JPEG compression
- Total variation minimization
- Randomized smoothing
- May reduce attack effectiveness
Adversarial Training
- Train on EOT-generated examples
- Improves robustness to transformations
- Computationally expensive
Multi-Modal Sensing
- Combine camera with lidar, radar
- Harder to fool all modalities simultaneously
- Common in autonomous vehicles
Temporal Consistency
- Check predictions across video frames
- Physical objects should be consistent
- Detect anomalous frame-to-frame changes
Anomaly Detection
- Detect unusual patterns (stickers, patches)
- Shape and texture analysis
- Machine learning for anomaly detection
Certified Defenses
- Randomized smoothing with provable guarantees
- Can certify robustness to bounded perturbations
- Active research area

7. Wrap-up & Discussion

Duration: 10 minutes

7.1 Key Takeaways

What We Learned:

Adversarial Examples are Real
- Neural networks are fundamentally vulnerable
- Small perturbations cause dramatic failures
- Both theoretical and practical threat
Attack Taxonomy
- White-box (FGSM, PGD, C&W) for strongest attacks
- Black-box (transfer, query-based) for realistic scenarios
- Physical attacks for real-world deployment
Transferability is Powerful
- Adversarial examples transfer across models
- Enables black-box attacks
- Security through obscurity fails
Physical Attacks are Practical
- Real-world demonstrations exist
- EOT makes attacks robust to transformations
- Serious implications for deployed systems

7.2 Critical Thinking Questions

Discussion Topics:

Fundamental Question:
- Are adversarial examples a bug or a feature of machine learning?
- Can we ever fully eliminate them?
Ethical Considerations:
- Should researchers publish adversarial attack methods?
- How to balance security research with potential misuse?
Real-World Deployment:
- What systems are most at risk?
- How should organizations respond?
Defense vs. Attack:
- Is this an arms race with no end?
- What's the path forward?

7.3 Looking Ahead

Next Week: Data Poisoning & Backdoor Attacks

Training-time attacks
How attackers can compromise models before deployment
Trojan behaviors in neural networks

Connections:

Today: Test-time evasion attacks
Next week: Training-time poisoning attacks
Together: Complete picture of adversarial ML threats

7.4 Assignment Preview

Homework 3: Implementing Adversarial Attacks

Due: Date

Tasks:

Implement FGSM and PGD on CIFAR-10
Evaluate transferability between architectures
Experiment with EOT for robustness
Written report on findings

Rubric:

Implementation correctness (40%)
Experimental methodology (30%)
Analysis and insights (20%)
Code quality and documentation (10%)

Starter Code: Will be posted on Canvas

7.5 Resources for Further Study

Seminal Papers:

Szegedy et al. (2013) - "Intriguing properties of neural networks"
Goodfellow et al. (2014) - "Explaining and Harnessing Adversarial Examples"
Madry et al. (2017) - "Towards Deep Learning Models Resistant to Adversarial Attacks"
Carlini & Wagner (2017) - "Towards Evaluating the Robustness of Neural Networks"

Tutorials and Surveys:

Adversarial Robustness Toolbox (ART) - IBM
CleverHans Library - Google Brain
Adversarial ML Reading List - Nicholas Carlini

Online Resources:

OpenAI Blog on Adversarial Examples
Google AI Blog - Security & Privacy
NIST Adversarial ML Framework

Appendix: Code Repositories

Complete Implementation: All code from today's demos is available in the course repository:

/course-materials/week3-adversarial-attacks/
├── fgsm.py
├── pgd.py
├── cw_attack.py
├── eot_physical.py
├── transfer_experiments.py
└── visualization_utils.py

Dependencies:

pip install torch torchvision matplotlib numpy

Quick Start:

git clone [course-repo-url]
cd week3-adversarial-attacks
python fgsm.py --epsilon 0.3 --model resnet18

Questions?

Office Hours: Tuesday/Thursday, 1:00-3:30 PM (Zoom)
Email: zhengxiong.li@ucdenver.edu
Discussion Forum: Canvas

Remember: The best way to understand adversarial attacks is to implement them yourself!

End of Week 3 Tutorial

Week 2: Security Fundamentals for ML/AI Systems

Security fundamentals for machine learning and AI systems

Week 4: Data Poisoning & Backdoor Attacks

Data poisoning and backdoor attacks in machine learning

On This Page

CSCI 5773 - Introduction to Emerging Systems Security
Table of Contents
1. Introduction & Motivation
2. Fundamentals of Adversarial Machine Learning
3. White-box vs. Black-box Attacks
4. Attack Methods: FGSM, PGD, C&W
5. Transferability of Adversarial Examples
6. Physical Adversarial Examples
7. Wrap-up & Discussion
Appendix: Code Repositories
Questions?