Week 3: Evasion Attacks & Adversarial Examples
CSCI 5773 - Introduction to Emerging Systems Security
Duration: 140-150 minutes
Module: Adversarial Machine Learning
Instructor: Dr. Zhengxiong Li
Table of Contents
- Introduction & Motivation (15 min)
- Fundamentals of Adversarial Machine Learning (20 min)
- White-box vs. Black-box Attacks (15 min)
- Attack Methods: FGSM, PGD, C&W (50 min)
- Transferability of Adversarial Examples (20 min)
- Physical Adversarial Examples (20 min)
- Wrap-up & Discussion (10 min)
1. Introduction & Motivation
Duration: 15 minutes
1.1 The Adversarial Machine Learning Problem
Opening Question: What happens when a machine learning model sees something it wasn't designed to handle?
Imagine a self-driving car's vision system that classifies a stop sign as a speed limit sign because someone placed a few carefully designed stickers on it. Or a spam filter that fails to detect malicious emails because attackers subtly modified their wording. These are examples of adversarial attacks on machine learning systems.
1.2 Historical Context
The Szegedy et al. Discovery (2013)
- Google researchers discovered that adding imperceptible perturbations to images could fool neural networks
- A panda image + carefully crafted noise = classified as "gibbon" with 99.3% confidence
- The perturbations are so small that humans cannot detect them
- Key insight: Neural networks are vulnerable to inputs specifically designed to exploit their decision boundaries
1.3 Why This Matters
Real-World Security Implications:
- Autonomous Vehicles
- Misclassifying traffic signs, pedestrians, or obstacles
- Potential for accidents or unauthorized access
- Biometric Authentication
- Fooling face recognition systems
- Bypassing fingerprint or iris scanners
- Malware Detection
- Evading antivirus and intrusion detection systems
- Polymorphic malware that adapts to avoid detection
- Content Moderation
- Bypassing filters for hate speech, misinformation, or illegal content
- Automated censorship evasion
- Medical Diagnosis
- Misclassifying medical images (X-rays, MRIs)
- Potential for incorrect diagnoses and treatments
1.4 Learning Objectives for Today
By the end of this session, you will be able to:
- ✅ Understand the fundamental concepts of adversarial machine learning
- ✅ Distinguish between white-box and black-box attack scenarios
- ✅ Implement basic adversarial attacks (FGSM, PGD, C&W)
- ✅ Evaluate the transferability of adversarial examples across models
- ✅ Recognize the challenges of physical-world adversarial attacks
- ✅ Assess model robustness against adversarial inputs
2. Fundamentals of Adversarial Machine Learning
Duration: 20 minutes
2.1 What is an Adversarial Example?
Definition: An adversarial example is a specially crafted input designed to cause a machine learning model to make a mistake, typically by adding small, carefully calculated perturbations to legitimate inputs.
Mathematical Formulation:
Given:
- A classifier function:
f(x) = y - An original input:
xwith true labely_true - A perturbation:
δ(delta)
An adversarial example is: x_adv = x + δ
Where:
f(x_adv) ≠ y_true(misclassification occurs)||δ|| < ε(perturbation is small, typically constrained by epsilon)
2.2 Types of Adversarial Attacks
By Attack Goal:
- Untargeted Attacks
- Goal: Cause any misclassification
- Example: Make a "cat" image classified as anything except "cat"
- Easier to achieve
- Targeted Attacks
- Goal: Cause misclassification to a specific target class
- Example: Make a "cat" image classified specifically as "dog"
- More challenging, requires more sophisticated perturbations
By Perturbation Constraints:
- L∞ (L-infinity) Norm
- Limits maximum change to any single pixel
||δ||∞ ≤ εmeans no pixel changes by more than ε- Most commonly used in research
- Example: ε = 0.03 on 0,1 scale
- L2 (Euclidean) Norm
- Limits total perturbation energy
||δ||₂ ≤ εconstrains the Euclidean distance- Better represents overall image distortion
- L0 Norm
- Limits number of pixels that can be changed
- Sparse perturbations
- Example: Modify only 10% of pixels
2.3 The Threat Model
Understanding adversarial attacks requires defining the threat model - what the attacker knows and can do.
Key Dimensions:
- Knowledge of the Model
- White-box: Full access to model architecture, weights, and training data
- Black-box: Only query access (input/output)
- Gray-box: Partial knowledge (e.g., architecture but not weights)
- Attack Capability
- Test-time attacks: Modify inputs at inference
- Training-time attacks: Poison training data (covered in Week 4)
- Physical Constraints
- Digital attacks: Direct pixel manipulation
- Physical attacks: Real-world modifications (stickers, lighting, etc.)
2.4 Why Are Neural Networks Vulnerable?
Key Factors:
- High Dimensionality
- Images have thousands/millions of dimensions
- Small changes in many dimensions accumulate
- Linear Nature
- Despite non-linear activations, many models are locally linear
- Small perturbations propagate linearly through layers
- Overconfidence
- Models make confident predictions even far from training distribution
- No built-in uncertainty quantification
- Decision Boundary Proximity
- Natural images often lie close to decision boundaries
- Easy to push them across with small perturbations
Visual Analogy:
The perturbation δ pushes the input across the decision boundary.
3. White-box vs. Black-box Attacks
Duration: 15 minutes
3.1 White-box Attacks
Definition: The attacker has complete knowledge of the target model.
Attacker's Knowledge:
- ✅ Model architecture (layers, activations, etc.)
- ✅ Model parameters (weights and biases)
- ✅ Training procedure and hyperparameters
- ✅ Training data distribution (sometimes)
- ✅ Gradient information
Advantages:
- Can compute exact gradients
- Most effective attacks possible
- Theoretical worst-case scenario
- Useful for robustness testing
Attack Strategy:
- Use gradient-based optimization
- Leverage backpropagation
- Direct computation of optimal perturbations
Example Scenario:
3.2 Black-box Attacks
Definition: The attacker has no knowledge of model internals, only query access.
Attacker's Knowledge:
- ✅ Input format
- ✅ Output format (labels, probabilities)
- ❌ Model architecture
- ❌ Model parameters
- ❌ Gradient information
Attack Approaches:
- Query-based Attacks
- Submit inputs and observe outputs
- Estimate gradients through finite differences
- Requires many queries (can be expensive/detectable)
- Transfer-based Attacks
- Train a substitute model on similar data
- Generate adversarial examples on substitute
- Transfer them to target model
- Exploits transferability property
- Decision-based Attacks
- Only observe final decisions (no probabilities)
- Use boundary-following techniques
- Requires even more queries
Example Scenario:
3.3 Comparison Table
| Aspect | White-box | Black-box |
|---|---|---|
| Model Access | Complete | Query only |
| Gradient Info | Available | Must estimate |
| Attack Success Rate | Highest | Lower |
| Queries Required | Few | Many |
| Realism | Lower (rarely have full access) | Higher |
| Detection Risk | Lower | Higher (many queries) |
| Computation | Efficient | Expensive |
3.4 Gray-box Attacks (Intermediate)
Definition: Partial knowledge of the model.
Common Scenarios:
- Know architecture but not weights (e.g., public model types)
- Have access to similar models from same vendor
- Know training methodology but not exact data
Strategy:
- Use architecture knowledge to build substitute
- Fine-tune on available data
- Apply transfer attacks
4. Attack Methods: FGSM, PGD, C&W
Duration: 50 minutes
4.1 Fast Gradient Sign Method (FGSM)
Duration: 15 minutes
Developed by: Ian Goodfellow et al. (2014)
Type: White-box, single-step attack
Complexity: Low (fast and simple)
The Core Idea
FGSM linearizes the loss function around the current input and takes a single step in the direction that maximizes loss.
Mathematical Formulation:
Where:
x: Original inputε: Perturbation magnitude (typically 0.01 to 0.3)∇_x L(f(x), y_true): Gradient of loss w.r.t. inputsign(): Takes the sign of each gradient component (+1, 0, or -1)y_true: True label (for untargeted) or target label (for targeted)
Intuition
Think of the loss function as a hill:
- For untargeted attacks: Climb the hill (increase loss) → misclassification
- For targeted attacks: Go downhill toward target class
The sign() function ensures we move the same distance (ε) in each dimension, regardless of gradient magnitude.
Step-by-Step Process
- Forward pass: Compute model prediction on original input
- Compute loss: Calculate loss w.r.t. true/target label
- Backward pass: Compute gradient of loss w.r.t. input
- Generate perturbation: Take sign of gradient, multiply by ε
- Create adversarial example: Add perturbation to original input
- Clip values: Ensure pixels remain in valid range 0, 1
Code Demo: FGSM Implementation
Strengths and Limitations
Strengths:
- ✅ Extremely fast (single gradient computation)
- ✅ Simple to implement
- ✅ Good for testing baseline robustness
- ✅ Works well for small ε values
Limitations:
- ❌ Single-step method (not optimal)
- ❌ Lower success rate than iterative methods
- ❌ Can be easily defended against
- ❌ Limited transferability
Targeted FGSM Variant
For targeted attacks (forcing classification to specific target):
4.2 Projected Gradient Descent (PGD)
Duration: 20 minutes
Developed by: Madry et al. (2017)
Type: White-box, iterative attack
Complexity: Medium (multiple iterations)
The Core Idea
PGD is an iterative version of FGSM that takes multiple smaller steps and projects back onto the allowed perturbation space. It's considered one of the strongest first-order adversarial attacks.
Why Iterative?
- Single-step FGSM is suboptimal
- Multiple small steps find better adversarial examples
- Can escape local minima
- Achieves higher attack success rates
Mathematical Formulation:
Where:
T: Number of iterations (typically 10-100)α: Step size (typically ε/T or smaller)Π_{x+S}: Projection operator that clips to allowed spaceS: Allowed perturbation region (L∞ ball of radius ε)- Random initialization helps escape local minima
Key Components
- Random Start
- Initialize within ε-ball around original input
- Helps find stronger adversarial examples
- Prevents getting stuck in local optima
- Iterative Updates
- Take multiple gradient steps
- Each step smaller than FGSM (α << ε)
- More thorough exploration of loss landscape
- Projection
- After each step, project back to allowed region
- Ensures ||x_adv - x||∞ ≤ ε
- Maintains perturbation constraint
Projection Operator Explained
The projection ensures we stay within the ε-ball:
Code Demo: PGD Implementation
PGD Variants
- PGD-∞ (L-infinity constrained)
- What we've described above
- Most common in research
- Constrains maximum per-pixel change
- PGD-2 (L2 constrained)
- Constrains total Euclidean distance
- Different projection operator
- PGD with Momentum
- Accumulates gradient history
- Helps escape local minima
- Similar to momentum in optimization
PGD as Universal Attack Standard
PGD is often considered the gold standard for evaluating adversarial robustness:
- Strong enough to find most vulnerabilities
- Computationally tractable
- Well-understood theoretically
- Used in adversarial training (defense method)
Trade-offs:
| Aspect | FGSM | PGD |
|---|---|---|
| Speed | Very fast (1 iter) | Slower (40-100 iters) |
| Success Rate | Moderate | High |
| Perturbation Efficiency | Lower | Higher |
| Use Case | Quick testing | Thorough evaluation |
4.3 Carlini & Wagner (C&W) Attack
Duration: 15 minutes
Developed by: Nicholas Carlini and David Wagner (2017)
Type: White-box, optimization-based attack
Complexity: High (but very effective)
The Core Idea
C&W reformulates adversarial example generation as an optimization problem with carefully designed loss function. Instead of following gradients of standard classification loss, C&W optimizes a custom objective that balances:
- Misclassification (attack success)
- Perturbation minimization (imperceptibility)
This produces minimal perturbations that reliably fool the model.
Mathematical Formulation
Optimization Problem:
Where:
δ: Perturbation to be optimized||δ||_p: Perturbation magnitude (L0, L2, or L∞)c: Confidence parameter (balances two objectives)f(): Objective function measuring attack success
The f() Function:
Instead of directly using classification loss, C&W introduces:
Where:
Z(x'): Logits (pre-softmax outputs) for input x't: Target classκ(kappa): Confidence parametermax{Z(x')_i : i ≠ t}: Highest logit for non-target classes
Intuition:
- When
f(x') < 0: Attack succeeds (target logit is highest) κcontrols confidence margin (how strongly we want target class to win)cbalances attack success vs. perturbation size
Why C&W is Different
Advantages over FGSM/PGD:
- Minimal Perturbations
- Finds smallest possible perturbation
- More realistic threat model
- Harder to detect
- High Success Rate
- Near 100% success on many models
- Works even against defensive distillation
- Very hard to defend against
- Confidence Control
- Can specify confidence of misclassification
- κ parameter ensures robust adversarial examples
- Different Norms
- L0: Minimizes number of changed pixels (sparse)
- L2: Minimizes Euclidean distance (most common)
- L∞: Minimizes maximum per-pixel change
Disadvantages:
- Computational Cost
- Much slower than FGSM/PGD
- Requires solving optimization problem per example
- Can take seconds to minutes per image
- Complexity
- More parameters to tune (c, κ, learning rate)
- Requires careful initialization
- Binary search for optimal c
Change of Variables Trick
To enforce x + δ ∈ [0, 1], C&W uses a clever change of variables:
Where w is the optimization variable. This ensures:
tanh(w) ∈ [-1, 1]x_adv ∈ [0, 1]automatically- No need for explicit clipping
Then:
Code Demo: C&W L2 Attack
C&W Attack Variants
1. C&W L0 (Sparse Perturbations)
- Minimizes number of pixels changed
- Uses iterative pixel selection
- Useful for understanding minimal attack requirements
2. C&W L2 (Most Common)
- Minimizes Euclidean distance
- Balanced imperceptibility
- What we implemented above
3. C&W L∞ (Max Change)
- Minimizes maximum per-pixel change
- Comparable to PGD but more optimized
- Uses different optimization strategy
Comparison Summary
| Attack | Speed | Success Rate | Perturbation Size | Use Case |
|---|---|---|---|---|
| FGSM | Very Fast | Moderate | Large | Quick testing |
| PGD | Fast | High | Medium | Standard evaluation |
| C&W | Slow | Very High | Minimal | Best-case attack |
When to Use Each:
- FGSM: Initial robustness testing, computational constraints
- PGD: Standard adversarial training and evaluation
- C&W: Publication-quality attacks, minimal perturbations, breaking defenses
5. Transferability of Adversarial Examples
Duration: 20 minutes
5.1 The Transferability Phenomenon
Definition: An adversarial example crafted for one model often transfers to other models, even if they have different architectures or were trained on different data.
This is surprising because:
- Models have different architectures
- Different random initializations
- Different training procedures
- Yet they share similar vulnerabilities
Discovery: Szegedy et al. (2013) first observed that adversarial examples generated for one neural network often fool other networks.
5.2 Why Does Transferability Occur?
Hypotheses:
- Shared Decision Boundaries
- Different models learn similar decision boundaries
- All models try to approximate the same underlying data distribution
- Adversarial examples exploit geometry of the data manifold
- Linear Approximation
- Models are locally linear in high dimensions
- Similar linear approximations across models
- Perturbations that fool one linear region transfer to others
- Gradient Masking
- Some defenses hide gradients without fixing vulnerabilities
- Adversarial examples still transfer despite obfuscated gradients
- Reveals that defense is incomplete
- Shared Training Data
- Models trained on similar data learn similar features
- Common vulnerabilities in learned representations
- Transfer more likely between models from same domain
5.3 Factors Affecting Transferability
Model Similarity
High Transfer Probability:
- Same architecture family (e.g., ResNet-18 → ResNet-50)
- Similar training data
- Similar preprocessing
- Same task/domain
Low Transfer Probability:
- Very different architectures (CNN → Transformer)
- Different tasks (ImageNet → medical images)
- Different modalities (vision → audio)
Attack Method
Transfer Success Rates (Typical):
| Attack Method | Same Architecture | Different Architecture |
|---|---|---|
| FGSM | ~60% | ~30% |
| PGD (10 iter) | ~80% | ~50% |
| PGD (100 iter) | ~90% | ~60% |
| C&W | ~95% | ~70% |
Observations:
- Stronger attacks (more optimization) transfer better
- Iterative methods > single-step methods
- Ensemble attacks transfer best
5.4 Ensemble-based Attacks
Strategy: Generate adversarial examples that fool multiple models simultaneously.
Algorithm:
Code Demo: Ensemble Transfer Attack
Typical Results:
- Ensemble models: 90-95% attack success
- Victim model: 60-80% attack success (impressive transfer!)
5.5 Practical Implications
For Attackers (Black-box Scenarios)
Attack Pipeline:
- Identify target system (e.g., face recognition API)
- Collect similar training data
- Train substitute models
- Generate ensemble adversarial examples
- Test on target system
- Success without ever seeing target model!
Real Example:
- Target: Google Cloud Vision API
- Substitute: ImageNet-trained ResNets
- Result: 70%+ transfer success rate
For Defenders
Security Implications:
- Security through obscurity doesn't work
- Hiding model architecture provides little security
- Attackers can use transfer attacks
- Need robust models, not hidden models
- Adversarial training on diverse architectures
- Ensemble defenses
- Input preprocessing
- Detection opportunities
- Transferred examples may be less optimized
- Slightly larger perturbations
- Potential for detection mechanisms
5.6 Experimental Activity
Student Exercise (15 minutes):
6. Physical Adversarial Examples
Duration: 20 minutes
6.1 The Challenge of Physical Attacks
Digital vs. Physical Attacks:
| Aspect | Digital | Physical |
|---|---|---|
| Perturbation Control | Exact | Approximate |
| Environment | Controlled | Variable |
| Transformations | None | Viewing angle, lighting, distance |
| Medium | Pixels | Physical objects |
| Persistence | Temporary | Permanent |
Why Physical Attacks Matter:
- Real-world deployment scenarios (autonomous vehicles, security cameras)
- Persistent threats (stickers, printed patterns)
- Harder to detect and remove
- Demonstrate practical security vulnerabilities
6.2 Physical World Challenges
Environmental Variations:
- Viewing Angle
- Camera perspective changes
- 3D to 2D projection
- Occlusion and distortion
- Lighting Conditions
- Shadows and highlights
- Color shifts
- Reflections and glare
- Distance
- Resolution changes
- Focus and blur
- Scale variations
- Printing/Fabrication
- Color gamut limitations
- Material properties
- Texture and finish
The Core Problem:
Adversarial example must survive this entire pipeline!
6.3 Expectation over Transformation (EOT)
Developed by: Athalye et al. (2018)
Key Insight: Optimize adversarial examples to be robust across transformations
Algorithm:
Mathematical Formulation:
Code Demo: EOT for Physical Robustness
6.4 Case Studies: Real-World Physical Attacks
Case Study 1: Adversarial Stop Signs
Research: Eykholt et al. (2018) - "Robust Physical-World Attacks on Deep Learning Visual Classification"
Attack Scenario:
- Target: Traffic sign recognition in autonomous vehicles
- Method: Black and white stickers on stop signs
- Goal: Misclassify as speed limit or other signs
Results:
- Success rate: 80%+ in physical world
- Worked under various lighting and angles
- Only needed to modify ~20% of sign area
- Demonstrated serious autonomous vehicle vulnerability
Attack Process:
- Print adversarial patterns on stickers
- Place on stop sign in specific locations
- Attack survives camera capture and processing
- Model misclassifies sign
Defenses:
- Multi-view verification
- Temporal consistency (video frames)
- Anomaly detection on sign appearance
- Redundant sensing modalities
Case Study 2: Adversarial Eyeglasses
Research: Sharif et al. (2016) - "Accessory to the Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition"
Attack Scenario:
- Target: Face recognition systems
- Method: Specially designed eyeglass frames
- Goals:
- Dodging: Avoid detection
- Impersonation: Be recognized as someone else
Results:
- Impersonation success: 100% in some cases
- Dodging success: High
- Physically realizable (can be fabricated)
- Inconspicuous (looks like normal glasses)
Technical Approach:
- Optimize eyeglass frame pattern using EOT
- Account for different facial expressions and poses
- Print on actual glasses
- Test on commercial face recognition systems
Case Study 3: Adversarial Patches
Research: Brown et al. (2017) - "Adversarial Patch"
Attack Concept:
- Small localized patch (can place anywhere in scene)
- Causes misclassification when captured by camera
- Independent of object location
- Universal (one patch works for many images)
Example Applications:
Real-World Implications:
- Attacker can print and place patch
- Works regardless of scene composition
- Very practical threat
- Hard to defend (patch can be anywhere)
6.5 Defenses Against Physical Attacks
Challenges:
- Physical attacks are harder to defend against
- Transformations make detection difficult
- Adversaries can iterate in physical world
Defense Strategies:
- Input Preprocessing
- JPEG compression
- Total variation minimization
- Randomized smoothing
- May reduce attack effectiveness
- Adversarial Training
- Train on EOT-generated examples
- Improves robustness to transformations
- Computationally expensive
- Multi-Modal Sensing
- Combine camera with lidar, radar
- Harder to fool all modalities simultaneously
- Common in autonomous vehicles
- Temporal Consistency
- Check predictions across video frames
- Physical objects should be consistent
- Detect anomalous frame-to-frame changes
- Anomaly Detection
- Detect unusual patterns (stickers, patches)
- Shape and texture analysis
- Machine learning for anomaly detection
- Certified Defenses
- Randomized smoothing with provable guarantees
- Can certify robustness to bounded perturbations
- Active research area
7. Wrap-up & Discussion
Duration: 10 minutes
7.1 Key Takeaways
What We Learned:
- Adversarial Examples are Real
- Neural networks are fundamentally vulnerable
- Small perturbations cause dramatic failures
- Both theoretical and practical threat
- Attack Taxonomy
- White-box (FGSM, PGD, C&W) for strongest attacks
- Black-box (transfer, query-based) for realistic scenarios
- Physical attacks for real-world deployment
- Transferability is Powerful
- Adversarial examples transfer across models
- Enables black-box attacks
- Security through obscurity fails
- Physical Attacks are Practical
- Real-world demonstrations exist
- EOT makes attacks robust to transformations
- Serious implications for deployed systems
7.2 Critical Thinking Questions
Discussion Topics:
- Fundamental Question:
- Are adversarial examples a bug or a feature of machine learning?
- Can we ever fully eliminate them?
- Ethical Considerations:
- Should researchers publish adversarial attack methods?
- How to balance security research with potential misuse?
- Real-World Deployment:
- What systems are most at risk?
- How should organizations respond?
- Defense vs. Attack:
- Is this an arms race with no end?
- What's the path forward?
7.3 Looking Ahead
Next Week: Data Poisoning & Backdoor Attacks
- Training-time attacks
- How attackers can compromise models before deployment
- Trojan behaviors in neural networks
Connections:
- Today: Test-time evasion attacks
- Next week: Training-time poisoning attacks
- Together: Complete picture of adversarial ML threats
7.4 Assignment Preview
Homework 3: Implementing Adversarial Attacks
Due: Date
Tasks:
- Implement FGSM and PGD on CIFAR-10
- Evaluate transferability between architectures
- Experiment with EOT for robustness
- Written report on findings
Rubric:
- Implementation correctness (40%)
- Experimental methodology (30%)
- Analysis and insights (20%)
- Code quality and documentation (10%)
Starter Code: Will be posted on Canvas
7.5 Resources for Further Study
Seminal Papers:
- Szegedy et al. (2013) - "Intriguing properties of neural networks"
- Goodfellow et al. (2014) - "Explaining and Harnessing Adversarial Examples"
- Madry et al. (2017) - "Towards Deep Learning Models Resistant to Adversarial Attacks"
- Carlini & Wagner (2017) - "Towards Evaluating the Robustness of Neural Networks"
Tutorials and Surveys:
- Adversarial Robustness Toolbox (ART) - IBM
- CleverHans Library - Google Brain
- Adversarial ML Reading List - Nicholas Carlini
Online Resources:
- OpenAI Blog on Adversarial Examples
- Google AI Blog - Security & Privacy
- NIST Adversarial ML Framework
Appendix: Code Repositories
Complete Implementation: All code from today's demos is available in the course repository:
Dependencies:
Quick Start:
Questions?
Office Hours: Tuesday/Thursday, 1:00-3:30 PM (Zoom)
Email: zhengxiong.li@ucdenver.edu
Discussion Forum: Canvas
Remember: The best way to understand adversarial attacks is to implement them yourself!
End of Week 3 Tutorial