Week 5: Privacy Attacks: Model Inversion & Membership Inference

Module: Adversarial Machine Learning
Course: CSCI 5773 - Introduction to Emerging Systems Security
Duration: 140-150 minutes
Instructor: Dr. Zhengxiong Li


Learning Objectives

By the end of this session, students will be able to:

  1. Understand privacy risks in ML systems - Identify and explain the fundamental privacy threats that emerge when deploying machine learning models
  2. Implement membership inference attacks - Design and execute attacks that determine whether specific data points were used in training
  3. Apply differential privacy techniques - Implement privacy-preserving mechanisms to protect training data while maintaining model utility

Session Overview

SectionTopicDuration
1Privacy Threats in Machine Learning25 min
2Membership Inference Attacks35 min
3Model Inversion and Extraction Attacks30 min
4Differential Privacy Fundamentals30 min
5Federated Learning Privacy Considerations20 min
6Wrap-up and Q&A10 min

Section 1: Privacy Threats in Machine Learning (25 minutes)

1.1 Introduction: Why Privacy Matters in ML

Machine learning models are not just mathematical functions—they are compressed representations of their training data. This fundamental property creates significant privacy risks that many practitioners overlook.

The Privacy Paradox in ML:

  • We want models that generalize well to new data
  • But models inevitably memorize aspects of their training data
  • This memorization creates an information leakage channel

1.2 The Machine Learning Privacy Threat Landscape

┌─────────────────────────────────────────────────────────────────────┐
│                    ML PRIVACY THREAT TAXONOMY                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐     │
│  │  TRAINING TIME  │  │ INFERENCE TIME  │  │   MODEL-LEVEL   │     │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤     │
│  │ • Data Poisoning│  │ • Membership    │  │ • Model         │     │
│  │ • Label Leakage │  │   Inference     │  │   Extraction    │     │
│  │ • Gradient      │  │ • Model         │  │ • Watermark     │     │
│  │   Leakage       │  │   Inversion     │  │   Removal       │     │
│  │ • Insider       │  │ • Attribute     │  │ • Fine-tuning   │     │
│  │   Threats       │  │   Inference     │  │   Attacks       │     │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘     │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

1.3 Understanding Information Leakage

Key Concept: Unintentional Memorization

Neural networks can memorize specific training examples, especially:

  • Rare or unique data points (outliers)
  • Data that appears multiple times
  • Data with distinctive patterns

Example: Credit Card Memorization in Language Models

Research has shown that large language models can memorize and reproduce sensitive information like credit card numbers if they appear in training data:

Training data: "My credit card number is 4532-1234-5678-9012"
                              ↓
                    Model Training
                              ↓
Query: "My credit card number is 4532-"
Model output: "1234-5678-9012"  ← Information leakage!

1.4 Categorizing Privacy Attacks

Attack TypeAttacker's GoalRequired AccessExample Scenario
Membership InferenceDetermine if specific data was in training setBlack-box (predictions only)Was my medical record used to train this diagnostic model?
Model InversionReconstruct training data featuresBlack-box or white-boxRecover faces from a facial recognition model
Attribute InferenceInfer sensitive attributes of training dataBlack-boxInfer income level from credit model
Model ExtractionSteal the model itselfBlack-box (queries)Clone a proprietary fraud detection model
Training Data ExtractionExtract verbatim training examplesBlack-boxExtract memorized text from LLMs

1.5 Demo: Visualizing Model Memorization

"""
Demo: Visualizing how models memorize training data
This example shows how model confidence differs between 
training data and unseen data
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate synthetic dataset
X, y = make_classification(
    n_samples=1000, 
    n_features=20, 
    n_informative=10,
    n_redundant=5,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a model that will overfit (demonstrating memorization)
model = MLPClassifier(
    hidden_layer_sizes=(100, 100, 100),  # Deep network
    max_iter=1000,
    random_state=42
)
model.fit(X_train, y_train)

# Get prediction confidences
train_probs = model.predict_proba(X_train)
test_probs = model.predict_proba(X_test)

# Extract confidence for correct class
train_confidence = np.max(train_probs, axis=1)
test_confidence = np.max(test_probs, axis=1)

print(f"Training set - Mean confidence: {train_confidence.mean():.4f}")
print(f"Test set - Mean confidence: {test_confidence.mean():.4f}")
print(f"Confidence gap: {train_confidence.mean() - test_confidence.mean():.4f}")

# This gap is what membership inference attacks exploit!

Expected Output:

Training set - Mean confidence: 0.9847
Test set - Mean confidence: 0.8923
Confidence gap: 0.0924

Key Insight: The confidence gap between training and test data is the fundamental vulnerability that membership inference attacks exploit. Models are systematically more confident about data they've seen during training.

1.6 Real-World Privacy Incidents

Case Study 1: Netflix Prize (2006-2009)

  • Netflix released "anonymized" movie ratings for a recommendation competition
  • Researchers de-anonymized users by cross-referencing with IMDb ratings
  • Revealed sensitive information: political preferences, sexual orientation
  • Lesson: Anonymization alone is insufficient for privacy

Case Study 2: Strava Heat Map (2018)

  • Fitness app published aggregate exercise data as heat maps
  • Military bases were revealed through running patterns
  • Soldier identities could be inferred from consistent routes
  • Lesson: Aggregate data can still leak individual information

Case Study 3: GPT-2/GPT-3 Training Data Extraction (2020-2021)

  • Researchers extracted memorized content from language models
  • Recovered personal information, code snippets, URLs
  • Lesson: Large models can memorize and leak training data

Section 2: Membership Inference Attacks (35 minutes)

2.1 What is Membership Inference?

Definition: A membership inference attack (MIA) determines whether a specific data record was used in the training dataset of a machine learning model.

┌─────────────────────────────────────────────────────────────┐
│              MEMBERSHIP INFERENCE ATTACK                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   Attacker has:                                              │
│   • Access to target model (black-box or white-box)         │
│   • A specific data record x                                 │
│   • (Optional) Similar data distribution                     │
│                                                              │
│   Attacker wants to know:                                    │
│   • Was x ∈ Training Set?                                    │
│                                                              │
│   ┌─────────┐      Query(x)      ┌─────────────┐            │
│   │         │ ─────────────────► │   Target    │            │
│   │Attacker │                    │    Model    │            │
│   │         │ ◄───────────────── │             │            │
│   └─────────┘   Prediction(x)    └─────────────┘            │
│        │                                                     │
│        ▼                                                     │
│   ┌─────────────────────────────────────────┐               │
│   │  Attack Model determines:                │               │
│   │  x ∈ Training Set  OR  x ∉ Training Set │               │
│   └─────────────────────────────────────────┘               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

2.2 Why Does Membership Inference Work?

The Overfitting Hypothesis:

Machine learning models behave differently on their training data compared to unseen data:

  1. Higher confidence: Models output higher prediction probabilities for training samples
  2. Lower loss: Training samples have lower loss values
  3. Different gradients: Gradient norms differ between members and non-members

2.3 Attack Taxonomy

Attack TypeDescriptionComplexity
Threshold AttackSimple confidence thresholdLow
Shadow Model AttackTrain attack model on shadow modelsMedium
Label-Only AttackUses only predicted labelsMedium
Likelihood Ratio AttackStatistical hypothesis testingHigh

2.4 Threshold-Based Attack (Simple)

The simplest membership inference uses a confidence threshold:

"""
Demo: Simple Threshold-Based Membership Inference Attack
"""

import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Load dataset
digits = load_digits()
X, y = digits.data, digits.target

# Split: 50% train (members), 50% test (non-members)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42
)

# Train target model
target_model = MLPClassifier(
    hidden_layer_sizes=(128, 64),
    max_iter=500,
    random_state=42
)
target_model.fit(X_train, y_train)

def threshold_attack(model, X, y_true, threshold=0.9):
    """
    Membership inference using confidence threshold.
    
    Intuition: Training samples typically have higher prediction confidence
    
    Args:
        model: Target classifier
        X: Data samples to test
        y_true: True labels
        threshold: Confidence threshold for membership decision
    
    Returns:
        membership_predictions: 1 if predicted member, 0 otherwise
    """
    # Get prediction probabilities
    probs = model.predict_proba(X)
    
    # Get confidence for the correct class
    confidences = np.array([
        probs[i, y_true[i]] for i in range(len(y_true))
    ])
    
    # Predict membership based on threshold
    membership_predictions = (confidences >= threshold).astype(int)
    
    return membership_predictions, confidences

# Attack training data (should predict "member" = 1)
train_preds, train_conf = threshold_attack(
    target_model, X_train, y_train, threshold=0.8
)

# Attack test data (should predict "non-member" = 0)  
test_preds, test_conf = threshold_attack(
    target_model, X_test, y_test, threshold=0.8
)

# Ground truth: train=1 (member), test=0 (non-member)
y_attack_true = np.concatenate([
    np.ones(len(X_train)), 
    np.zeros(len(X_test))
])
y_attack_pred = np.concatenate([train_preds, test_preds])

# Evaluate attack
print("=== Membership Inference Attack Results ===")
print(f"Attack Accuracy: {accuracy_score(y_attack_true, y_attack_pred):.4f}")
print(f"Attack Precision: {precision_score(y_attack_true, y_attack_pred):.4f}")
print(f"Attack Recall: {recall_score(y_attack_true, y_attack_pred):.4f}")
print(f"\nTraining data avg confidence: {train_conf.mean():.4f}")
print(f"Test data avg confidence: {test_conf.mean():.4f}")

Expected Output:

=== Membership Inference Attack Results ===
Attack Accuracy: 0.6824
Attack Precision: 0.6532
Attack Recall: 0.7891
Training data avg confidence: 0.9234
Test data avg confidence: 0.8156

2.5 Shadow Model Attack (Shokri et al., 2017)

The shadow model attack is more sophisticated and doesn't require prior knowledge of the optimal threshold.

Attack Pipeline:

┌──────────────────────────────────────────────────────────────────────┐
│                    SHADOW MODEL ATTACK PIPELINE                       │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  STEP 1: Create Shadow Training Data                                  │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Similar distribution to target's training data              │    │
│  │  Split into: D_in (members) and D_out (non-members)         │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                        │
│                              ▼                                        │
│  STEP 2: Train Shadow Models                                          │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Train k shadow models on different D_in splits              │    │
│  │  These mimic the target model's behavior                     │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                        │
│                              ▼                                        │
│  STEP 3: Generate Attack Training Data                                │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Query shadow models with D_in (label=1) and D_out (label=0)│    │
│  │  Features: prediction vector from shadow model               │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                        │
│                              ▼                                        │
│  STEP 4: Train Attack Model                                           │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Binary classifier: prediction vector → member/non-member    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                        │
│                              ▼                                        │
│  STEP 5: Attack Target Model                                          │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Query target model → Get prediction vector                  │    │
│  │  Feed to attack model → Get membership prediction            │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                       │
└──────────────────────────────────────────────────────────────────────┘

2.6 Implementation: Shadow Model Attack

"""
Demo: Shadow Model Membership Inference Attack
Complete implementation following Shokri et al. (2017)
"""

import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

class ShadowModelAttack:
    """
    Membership Inference Attack using Shadow Models
    """
    
    def __init__(self, n_shadow_models=3, target_model_class=MLPClassifier):
        self.n_shadow_models = n_shadow_models
        self.target_model_class = target_model_class
        self.shadow_models = []
        self.attack_model = None
        
    def _train_shadow_model(self, X_train, y_train):
        """Train a single shadow model"""
        model = self.target_model_class(
            hidden_layer_sizes=(64, 32),
            max_iter=300,
            random_state=np.random.randint(10000)
        )
        model.fit(X_train, y_train)
        return model
    
    def prepare_attack_data(self, X_shadow, y_shadow):
        """
        Train shadow models and generate attack training data
        
        Returns:
            X_attack: Feature vectors (prediction probabilities)
            y_attack: Labels (1=member, 0=non-member)
        """
        X_attack_list = []
        y_attack_list = []
        
        for i in range(self.n_shadow_models):
            # Random split for each shadow model
            X_in, X_out, y_in, y_out = train_test_split(
                X_shadow, y_shadow, 
                test_size=0.5, 
                random_state=i
            )
            
            # Train shadow model
            shadow_model = self._train_shadow_model(X_in, y_in)
            self.shadow_models.append(shadow_model)
            
            # Get predictions for members (in) and non-members (out)
            pred_in = shadow_model.predict_proba(X_in)
            pred_out = shadow_model.predict_proba(X_out)
            
            # Create attack features: concatenate prediction probs with true label
            # This helps the attack model learn class-specific patterns
            attack_features_in = np.column_stack([
                pred_in, 
                y_in
            ])
            attack_features_out = np.column_stack([
                pred_out, 
                y_out
            ])
            
            X_attack_list.extend(attack_features_in)
            X_attack_list.extend(attack_features_out)
            
            # Labels: 1 for members, 0 for non-members
            y_attack_list.extend([1] * len(X_in))
            y_attack_list.extend([0] * len(X_out))
        
        return np.array(X_attack_list), np.array(y_attack_list)
    
    def train_attack_model(self, X_attack, y_attack):
        """Train the attack classifier"""
        self.attack_model = LogisticRegression(max_iter=1000)
        self.attack_model.fit(X_attack, y_attack)
        
        # Evaluate on attack training data
        train_acc = accuracy_score(y_attack, self.attack_model.predict(X_attack))
        print(f"Attack model training accuracy: {train_acc:.4f}")
        
    def attack(self, target_model, X_query, y_query):
        """
        Perform membership inference attack
        
        Args:
            target_model: The model being attacked
            X_query: Samples to query
            y_query: True labels of query samples
            
        Returns:
            membership_predictions: 1 if predicted member, 0 otherwise
        """
        # Get target model predictions
        target_preds = target_model.predict_proba(X_query)
        
        # Create attack features
        attack_features = np.column_stack([target_preds, y_query])
        
        # Predict membership
        membership_preds = self.attack_model.predict(attack_features)
        membership_probs = self.attack_model.predict_proba(attack_features)[:, 1]
        
        return membership_preds, membership_probs


# ===== DEMONSTRATION =====

# Generate dataset
print("Generating synthetic dataset...")
X, y = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=15,
    n_classes=5,
    n_clusters_per_class=2,
    random_state=42
)

# Split into target and shadow data (simulating different data sources)
X_target, X_shadow, y_target, y_shadow = train_test_split(
    X, y, test_size=0.5, random_state=42
)

# Further split target data into train (members) and test (non-members)
X_target_train, X_target_test, y_target_train, y_target_test = train_test_split(
    X_target, y_target, test_size=0.5, random_state=42
)

# Train target model
print("\nTraining target model...")
target_model = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    max_iter=300,
    random_state=42
)
target_model.fit(X_target_train, y_target_train)
print(f"Target model test accuracy: {target_model.score(X_target_test, y_target_test):.4f}")

# Initialize and prepare shadow model attack
print("\nPreparing shadow model attack...")
attack = ShadowModelAttack(n_shadow_models=5)

# Prepare attack training data
X_attack, y_attack = attack.prepare_attack_data(X_shadow, y_shadow)
print(f"Attack training data size: {len(X_attack)}")

# Train attack model
print("\nTraining attack model...")
attack.train_attack_model(X_attack, y_attack)

# Perform attack on target model's training data (members)
print("\n=== Attacking Training Data (Members) ===")
member_preds, member_probs = attack.attack(
    target_model, X_target_train, y_target_train
)
member_accuracy = accuracy_score(
    np.ones(len(X_target_train)), 
    member_preds
)
print(f"True Positive Rate (correctly identified members): {member_accuracy:.4f}")

# Perform attack on target model's test data (non-members)
print("\n=== Attacking Test Data (Non-members) ===")
nonmember_preds, nonmember_probs = attack.attack(
    target_model, X_target_test, y_target_test
)
nonmember_accuracy = accuracy_score(
    np.zeros(len(X_target_test)), 
    nonmember_preds
)
print(f"True Negative Rate (correctly identified non-members): {nonmember_accuracy:.4f}")

# Overall attack accuracy
all_true = np.concatenate([
    np.ones(len(X_target_train)), 
    np.zeros(len(X_target_test))
])
all_pred = np.concatenate([member_preds, nonmember_preds])

print("\n=== Overall Attack Performance ===")
print(classification_report(all_true, all_pred, 
                           target_names=['Non-member', 'Member']))

2.7 Factors Affecting Attack Success

FactorImpact on Attack SuccessExplanation
Model overfitting↑ Higher successOverfitted models memorize training data more
Training set size↓ Lower successLarger datasets = less memorization per sample
Model complexity↑ Higher successComplex models have more capacity to memorize
Number of classes↑ Higher successMore classes = more distinguishing information
Data uniqueness↑ Higher successUnique/outlier samples are more memorable

2.8 Class Exercise: Analyze Attack Results

Task: Modify the shadow model attack to answer these questions:

  1. How does the number of shadow models affect attack accuracy?
  2. How does the target model's training set size affect vulnerability?
  3. Which samples are most vulnerable to membership inference?
# Exercise template
def analyze_vulnerability_factors():
    """
    TODO: Experiment with different configurations
    
    1. Vary n_shadow_models: [1, 3, 5, 10]
    2. Vary target training size: [100, 500, 1000, 2000]
    3. Identify most vulnerable samples (highest membership probability)
    """
    pass

Section 3: Model Inversion and Extraction Attacks (30 minutes)

3.1 Model Inversion Attacks

Definition: Model inversion attacks attempt to reconstruct sensitive features of training data by exploiting the model's learned representations.

Key Insight: The model encodes information about training data in its parameters and predictions. We can "reverse" this encoding to recover input features.

┌─────────────────────────────────────────────────────────────────┐
│                    MODEL INVERSION ATTACK                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   FORWARD: Training Data → Model → Predictions                   │
│                                                                  │
│   ┌──────────┐         ┌───────┐         ┌────────────┐        │
│   │  Face    │  ─────► │ Model │  ─────► │ "Alice"    │        │
│   │  Image   │         │       │         │ (Label)    │        │
│   └──────────┘         └───────┘         └────────────┘        │
│                                                                  │
│   INVERSION: Label → Optimization → Reconstructed Data           │
│                                                                  │
│   ┌────────────┐       ┌───────┐         ┌──────────┐          │
│   │  "Alice"   │ ─────►│ Model │◄─────── │ ???      │          │
│   │  (Target)  │       │       │  grad   │ (Random) │          │
│   └────────────┘       └───────┘         └──────────┘          │
│                             │                   │                │
│                             └───────────────────┘                │
│                           Optimize to maximize                   │
│                           P("Alice" | reconstructed)             │
│                                                                  │
│   RESULT: Reconstructed approximation of Alice's face            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

3.2 Types of Model Inversion

TypeAccess RequiredAttack Goal
Confidence-basedBlack-box (probabilities)Reconstruct average class representation
Gradient-basedWhite-box (gradients)Reconstruct specific training examples
GenerativeBlack-box + auxiliary dataGenerate realistic reconstructions

3.3 Confidence-Based Model Inversion (Fredrikson et al., 2015)

"""
Demo: Basic Model Inversion Attack
Reconstructing class-representative features from a classifier
"""

import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Load and prepare data
iris = load_iris()
X, y = iris.data, iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train target model
model = MLPClassifier(
    hidden_layer_sizes=(32, 16),
    max_iter=500,
    random_state=42
)
model.fit(X_scaled, y)

def model_inversion_attack(model, target_class, n_features, 
                           n_iterations=1000, learning_rate=0.1):
    """
    Gradient-free model inversion using confidence scores.
    
    Objective: Find x* that maximizes P(target_class | x)
    
    Uses simple hill climbing with random perturbations.
    """
    # Initialize with random features
    x_reconstructed = np.random.randn(1, n_features)
    best_confidence = 0
    best_x = x_reconstructed.copy()
    
    for i in range(n_iterations):
        # Get current confidence for target class
        probs = model.predict_proba(x_reconstructed)[0]
        current_confidence = probs[target_class]
        
        if current_confidence > best_confidence:
            best_confidence = current_confidence
            best_x = x_reconstructed.copy()
        
        # Random perturbation
        perturbation = np.random.randn(1, n_features) * learning_rate
        
        # Try the perturbation
        x_new = x_reconstructed + perturbation
        new_probs = model.predict_proba(x_new)[0]
        new_confidence = new_probs[target_class]
        
        # Accept if confidence improves
        if new_confidence > current_confidence:
            x_reconstructed = x_new
        
        # Decrease learning rate over time
        if i % 200 == 0 and i > 0:
            learning_rate *= 0.8
    
    return best_x, best_confidence

# Perform model inversion for each class
print("=== Model Inversion Attack Results ===\n")
print("Feature names:", iris.feature_names)
print()

for target_class in range(3):
    class_name = iris.target_names[target_class]
    
    # Attack
    reconstructed, confidence = model_inversion_attack(
        model, target_class, n_features=4, n_iterations=2000
    )
    
    # Convert back to original scale for comparison
    reconstructed_original = scaler.inverse_transform(reconstructed)[0]
    
    # Get actual mean of training class
    actual_mean = X[y == target_class].mean(axis=0)
    
    print(f"Class: {class_name}")
    print(f"  Reconstruction confidence: {confidence:.4f}")
    print(f"  Reconstructed features: {reconstructed_original.round(2)}")
    print(f"  Actual class mean:      {actual_mean.round(2)}")
    print(f"  Feature-wise error:     {np.abs(reconstructed_original - actual_mean).round(2)}")
    print()

Expected Output:

=== Model Inversion Attack Results ===

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Class: setosa
  Reconstruction confidence: 0.9987
  Reconstructed features: [5.02 3.41 1.48 0.26]
  Actual class mean:      [5.01 3.43 1.46 0.25]
  Feature-wise error:     [0.01 0.02 0.02 0.01]

Class: versicolor
  Reconstruction confidence: 0.9823
  Reconstructed features: [5.94 2.78 4.31 1.35]
  Actual class mean:      [5.94 2.77 4.26 1.33]
  Feature-wise error:     [0.   0.01 0.05 0.02]

3.4 Model Extraction Attacks

Definition: Model extraction (or model stealing) attacks aim to create a functionally equivalent copy of a target model through query access.

Motivation for Attackers:

  • Bypass API costs
  • Prepare for white-box attacks
  • Steal intellectual property
  • Violate licensing agreements

3.5 Model Extraction Strategies

┌─────────────────────────────────────────────────────────────────────┐
│                    MODEL EXTRACTION STRATEGIES                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. EQUATION SOLVING (for simple models)                            │
│     ├── Query with carefully chosen inputs                          │
│     ├── Solve system of equations for parameters                    │
│     └── Works for: Linear models, decision trees                    │
│                                                                      │
│  2. KNOWLEDGE DISTILLATION                                          │
│     ├── Query target model with synthetic data                      │
│     ├── Use (input, prediction) pairs as training data              │
│     ├── Train surrogate model to mimic target                       │
│     └── Works for: Any differentiable model                         │
│                                                                      │
│  3. ACTIVE LEARNING                                                  │
│     ├── Strategically select queries near decision boundaries       │
│     ├── Maximize information gain per query                         │
│     └── More efficient than random sampling                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

3.6 Implementation: Model Extraction via Distillation

"""
Demo: Model Extraction Attack via Knowledge Distillation
"""

import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Create target model (simulating a proprietary model)
print("=== Setting up Target Model (Proprietary) ===")
X_private, y_private = make_moons(n_samples=1000, noise=0.1, random_state=42)

target_model = MLPClassifier(
    hidden_layer_sizes=(100, 50, 25),
    max_iter=500,
    random_state=42
)
target_model.fit(X_private, y_private)
print(f"Target model architecture: MLP with layers (100, 50, 25)")
print(f"Target model private data size: {len(X_private)}")

def extract_model(target_model, n_queries, feature_range, 
                  surrogate_class=RandomForestClassifier):
    """
    Extract a model through query access.
    
    Strategy: Knowledge distillation with synthetic queries
    
    Args:
        target_model: The model to steal
        n_queries: Number of queries to use
        feature_range: (min, max) for generating synthetic data
        surrogate_class: Type of model to train as surrogate
    
    Returns:
        surrogate_model: Extracted model
        query_data: The synthetic queries used
    """
    # Generate synthetic query data
    # Using uniform random sampling in the feature space
    n_features = 2
    min_val, max_val = feature_range
    
    X_synthetic = np.random.uniform(
        min_val, max_val, 
        size=(n_queries, n_features)
    )
    
    # Query target model to get labels
    y_synthetic = target_model.predict(X_synthetic)
    
    # Also get soft labels (probabilities) for better extraction
    y_probs = target_model.predict_proba(X_synthetic)
    
    # Train surrogate model on synthetic labeled data
    surrogate_model = surrogate_class(n_estimators=100, random_state=42)
    surrogate_model.fit(X_synthetic, y_synthetic)
    
    return surrogate_model, X_synthetic, y_synthetic

# Perform extraction with different query budgets
query_budgets = [50, 100, 500, 1000, 5000]
results = []

print("\n=== Model Extraction Attack ===")
print(f"Target model accuracy on private data: {target_model.score(X_private, y_private):.4f}")
print()

for n_queries in query_budgets:
    surrogate, X_syn, y_syn = extract_model(
        target_model, 
        n_queries=n_queries,
        feature_range=(-2, 3)
    )
    
    # Evaluate how well surrogate mimics target
    # Test on the private data (attacker doesn't have this, but we use for evaluation)
    target_preds = target_model.predict(X_private)
    surrogate_preds = surrogate.predict(X_private)
    
    # Fidelity: how often surrogate agrees with target
    fidelity = accuracy_score(target_preds, surrogate_preds)
    
    # Accuracy: how well surrogate performs on true task
    accuracy = accuracy_score(y_private, surrogate_preds)
    
    results.append({
        'queries': n_queries,
        'fidelity': fidelity,
        'accuracy': accuracy
    })
    
    print(f"Queries: {n_queries:5d} | Fidelity: {fidelity:.4f} | Accuracy: {accuracy:.4f}")

print("\n=== Key Insights ===")
print("• Fidelity measures how well the stolen model mimics the target")
print("• With enough queries, we can extract a high-fidelity copy")
print("• The extracted model can then be used for white-box attacks")

Expected Output:

=== Setting up Target Model (Proprietary) ===
Target model architecture: MLP with layers (100, 50, 25)
Target model private data size: 1000

=== Model Extraction Attack ===
Target model accuracy on private data: 0.9950

Queries:    50 | Fidelity: 0.8340 | Accuracy: 0.8290
Queries:   100 | Fidelity: 0.8910 | Accuracy: 0.8850
Queries:   500 | Fidelity: 0.9580 | Accuracy: 0.9530
Queries:  1000 | Fidelity: 0.9780 | Accuracy: 0.9750
Queries:  5000 | Fidelity: 0.9920 | Accuracy: 0.9890

=== Key Insights ===
• Fidelity measures how well the stolen model mimics the target
• With enough queries, we can extract a high-fidelity copy
• The extracted model can then be used for white-box attacks

3.7 Defenses Against Inversion and Extraction

DefenseMechanismTrade-offs
Prediction PerturbationAdd noise to confidence scoresReduces utility
Confidence MaskingOnly return top-k classesLimits functionality
Query Rate LimitingRestrict queries per userAffects legitimate users
WatermarkingEmbed identifiable patternsRequires verification mechanism
Differential PrivacyMathematically bounded leakageReduces accuracy

Section 4: Differential Privacy Fundamentals (30 minutes)

4.1 What is Differential Privacy?

Definition: Differential privacy is a mathematical framework that provides provable privacy guarantees by ensuring that any single individual's data has a limited impact on the output of a computation.

Intuition: An algorithm is differentially private if its output doesn't change much whether or not any single individual's data is included.

┌─────────────────────────────────────────────────────────────────────┐
│                    DIFFERENTIAL PRIVACY INTUITION                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   Database D  = {Alice, Bob, Carol, Dave, Eve}                      │
│   Database D' = {Alice, Bob, Carol, Dave}  (Eve removed)            │
│                                                                      │
│                                                                      │
│   ┌─────────────┐                         ┌─────────────┐           │
│   │  D (with    │──► Mechanism M(D) ──►  │  Output ~   │           │
│   │   Eve)      │                         │  similar    │           │
│   └─────────────┘                         └─────────────┘           │
│                                                   ≈                  │
│   ┌─────────────┐                         ┌─────────────┐           │
│   │  D' (without│──► Mechanism M(D')──►  │  Output ~   │           │
│   │   Eve)      │                         │  similar    │           │
│   └─────────────┘                         └─────────────┘           │
│                                                                      │
│   If outputs are similar, Eve's privacy is protected!               │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

4.2 Formal Definition

ε-Differential Privacy:

A randomized mechanism M provides ε-differential privacy if for all datasets D and D' differing in at most one element, and for all possible outputs S:

P[M(D) ∈ S] ≤ e^ε × P[M(D') ∈ S]

Understanding ε (epsilon):

  • ε = 0: Perfect privacy (completely random output)
  • ε → ∞: No privacy (deterministic output)
  • Typical values: ε ∈ 0.1, 10
ε ValuePrivacy LevelUse Case
0.1Very highSensitive medical data
1.0StandardGeneral analytics
5.0LowPublic statistics
10+MinimalNon-sensitive data

4.3 The Laplace Mechanism

The most common method to achieve differential privacy:

For a function f: D → R with sensitivity Δf:

M(D) = f(D) + Laplace(Δf / ε)

Where:

  • Sensitivity (Δf): Maximum change in f when one record changes
  • Laplace(b): Noise drawn from Laplace distribution with scale b

4.4 Implementation: Differential Privacy Basics

"""
Demo: Implementing Differential Privacy from Scratch
"""

import numpy as np
import matplotlib.pyplot as plt

class LaplaceMechanism:
    """
    Implements the Laplace mechanism for differential privacy.
    """
    
    def __init__(self, epsilon):
        """
        Args:
            epsilon: Privacy budget (smaller = more private)
        """
        self.epsilon = epsilon
    
    def add_noise(self, true_value, sensitivity):
        """
        Add Laplace noise to achieve ε-differential privacy.
        
        Args:
            true_value: The actual computation result
            sensitivity: Maximum change when one record changes
            
        Returns:
            Noisy value that satisfies ε-DP
        """
        scale = sensitivity / self.epsilon
        noise = np.random.laplace(0, scale)
        return true_value + noise
    
    def private_count(self, data, predicate):
        """
        Count elements satisfying predicate with DP.
        Sensitivity of count = 1 (one person can change count by at most 1)
        """
        true_count = sum(predicate(x) for x in data)
        return self.add_noise(true_count, sensitivity=1)
    
    def private_mean(self, data, data_range):
        """
        Compute mean with DP.
        Sensitivity of mean = range / n
        """
        n = len(data)
        true_mean = np.mean(data)
        sensitivity = data_range / n
        return self.add_noise(true_mean, sensitivity)
    
    def private_sum(self, data, max_contribution):
        """
        Compute sum with DP.
        Sensitivity = max contribution per individual
        """
        # Clip individual contributions
        clipped_data = np.clip(data, 0, max_contribution)
        true_sum = np.sum(clipped_data)
        return self.add_noise(true_sum, sensitivity=max_contribution)


# ===== DEMONSTRATION =====

# Create synthetic salary dataset
np.random.seed(42)
n_employees = 1000
salaries = np.random.normal(75000, 15000, n_employees)
salaries = np.clip(salaries, 30000, 200000)  # Realistic range

print("=== Differential Privacy Demo: Salary Statistics ===\n")
print(f"Dataset size: {n_employees} employees")
print(f"True mean salary: ${np.mean(salaries):,.2f}")
print(f"True total payroll: ${np.sum(salaries):,.2f}")
print()

# Test with different privacy budgets
epsilons = [0.1, 0.5, 1.0, 5.0, 10.0]

print("Private Mean Salary (multiple runs to show noise variance):")
print("-" * 60)

for epsilon in epsilons:
    dp = LaplaceMechanism(epsilon)
    
    # Run multiple times to show variance
    private_means = []
    for _ in range(5):
        private_mean = dp.private_mean(
            salaries, 
            data_range=200000-30000  # max - min salary
        )
        private_means.append(private_mean)
    
    avg_private = np.mean(private_means)
    std_private = np.std(private_means)
    error = abs(avg_private - np.mean(salaries))
    
    print(f"ε={epsilon:4.1f} | Private means: ${avg_private:,.0f} "
          f"(±${std_private:,.0f}) | Error: ${error:,.0f}")

print()
print("Private Count: Employees earning > $80,000")
print("-" * 60)
true_count = sum(salaries > 80000)
print(f"True count: {true_count}")

for epsilon in [0.5, 1.0, 5.0]:
    dp = LaplaceMechanism(epsilon)
    private_counts = [
        dp.private_count(salaries, lambda x: x > 80000)
        for _ in range(5)
    ]
    print(f"ε={epsilon:.1f} | Private counts: {[int(c) for c in private_counts]}")

Expected Output:

=== Differential Privacy Demo: Salary Statistics ===

Dataset size: 1000 employees
True mean salary: $74,892.35
True total payroll: $74,892,347.23

Private Mean Salary (multiple runs to show noise variance):
------------------------------------------------------------
ε= 0.1 | Private means: $76,543 (±$1,892) | Error: $1,651
ε= 0.5 | Private means: $75,012 (±$342) | Error: $120
ε= 1.0 | Private means: $74,923 (±$178) | Error: $31
ε= 5.0 | Private means: $74,889 (±$35) | Error: $3
ε=10.0 | Private means: $74,891 (±$17) | Error: $1

Private Count: Employees earning > $80,000
------------------------------------------------------------
True count: 371
ε=0.5 | Private counts: [373, 369, 375, 367, 372]
ε=1.0 | Private counts: [372, 370, 371, 372, 370]
ε=5.0 | Private counts: [371, 371, 371, 371, 371]

4.5 Differential Privacy in Machine Learning (DP-SGD)

DP-SGD (Differentially Private Stochastic Gradient Descent) modifies the training process to provide privacy guarantees:

┌─────────────────────────────────────────────────────────────────────┐
│                         DP-SGD ALGORITHM                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   Standard SGD:                                                      │
│   θ_{t+1} = θ_t - η · (1/B) Σᵢ ∇L(θ_t, xᵢ)                         │
│                                                                      │
│   DP-SGD adds two steps:                                            │
│                                                                      │
│   1. GRADIENT CLIPPING (bound sensitivity)                          │
│      g̃ᵢ = gᵢ / max(1, ||gᵢ||₂ / C)                                 │
│                                                                      │
│   2. NOISE ADDITION (add calibrated noise)                          │
│      θ_{t+1} = θ_t - η · [(1/B) Σᵢ g̃ᵢ + N(0, σ²C²I)]              │
│                                                                      │
│   Where:                                                             │
│   • C = clipping threshold                                          │
│   • σ = noise multiplier (determined by privacy budget)             │
│   • B = batch size                                                  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

4.6 Implementation: Simple DP-SGD

"""
Demo: Simplified DP-SGD Implementation
"""

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

class SimpleNeuralNetwork:
    """Simple 2-layer neural network for demonstration"""
    
    def __init__(self, input_dim, hidden_dim, output_dim):
        # Initialize weights
        self.W1 = np.random.randn(input_dim, hidden_dim) * 0.1
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(hidden_dim, output_dim) * 0.1
        self.b2 = np.zeros(output_dim)
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2
    
    def compute_gradients(self, X, y):
        """Compute gradients for a single sample"""
        m = X.shape[0]
        
        # Forward pass
        output = self.forward(X)
        
        # Backward pass
        dz2 = output - y.reshape(-1, 1)
        dW2 = self.a1.T @ dz2 / m
        db2 = np.mean(dz2, axis=0)
        
        da1 = dz2 @ self.W2.T
        dz1 = da1 * self.a1 * (1 - self.a1)
        dW1 = X.T @ dz1 / m
        db1 = np.mean(dz1, axis=0)
        
        return {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
    
    def predict(self, X):
        return (self.forward(X) > 0.5).astype(int).flatten()


class DPSGDTrainer:
    """DP-SGD Trainer with gradient clipping and noise addition"""
    
    def __init__(self, model, clip_norm=1.0, noise_multiplier=1.0, 
                 learning_rate=0.1):
        self.model = model
        self.clip_norm = clip_norm
        self.noise_multiplier = noise_multiplier
        self.lr = learning_rate
    
    def clip_gradient(self, grad_dict):
        """Clip gradient to have maximum L2 norm of clip_norm"""
        # Compute total gradient norm
        total_norm = 0
        for key in grad_dict:
            total_norm += np.sum(grad_dict[key] ** 2)
        total_norm = np.sqrt(total_norm)
        
        # Clip if necessary
        clip_factor = min(1.0, self.clip_norm / (total_norm + 1e-6))
        
        clipped = {}
        for key in grad_dict:
            clipped[key] = grad_dict[key] * clip_factor
        
        return clipped
    
    def add_noise(self, grad_dict):
        """Add Gaussian noise calibrated to the sensitivity"""
        noisy = {}
        for key in grad_dict:
            noise = np.random.normal(
                0, 
                self.noise_multiplier * self.clip_norm,
                grad_dict[key].shape
            )
            noisy[key] = grad_dict[key] + noise
        return noisy
    
    def train_step(self, X_batch, y_batch, private=True):
        """Perform one training step"""
        # Compute per-sample gradients and clip
        batch_grads = {'W1': [], 'b1': [], 'W2': [], 'b2': []}
        
        for i in range(len(X_batch)):
            # Compute gradient for single sample
            grad = self.model.compute_gradients(
                X_batch[i:i+1], 
                y_batch[i:i+1]
            )
            
            if private:
                # Clip individual gradient
                grad = self.clip_gradient(grad)
            
            for key in grad:
                batch_grads[key].append(grad[key])
        
        # Average gradients
        avg_grads = {}
        for key in batch_grads:
            avg_grads[key] = np.mean(batch_grads[key], axis=0)
        
        if private:
            # Add noise
            avg_grads = self.add_noise(avg_grads)
        
        # Update model
        self.model.W1 -= self.lr * avg_grads['W1']
        self.model.b1 -= self.lr * avg_grads['b1']
        self.model.W2 -= self.lr * avg_grads['W2']
        self.model.b2 -= self.lr * avg_grads['b2']
    
    def train(self, X, y, epochs=10, batch_size=32, private=True):
        """Train the model"""
        n = len(X)
        
        for epoch in range(epochs):
            # Shuffle data
            indices = np.random.permutation(n)
            X_shuffled = X[indices]
            y_shuffled = y[indices]
            
            # Mini-batch training
            for i in range(0, n, batch_size):
                X_batch = X_shuffled[i:i+batch_size]
                y_batch = y_shuffled[i:i+batch_size]
                self.train_step(X_batch, y_batch, private=private)


# ===== DEMONSTRATION =====

# Generate dataset
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    n_informative=10,
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("=== DP-SGD vs Standard SGD Comparison ===\n")

# Train non-private model
print("Training non-private model...")
model_standard = SimpleNeuralNetwork(20, 32, 1)
trainer_standard = DPSGDTrainer(
    model_standard, 
    learning_rate=0.5
)
trainer_standard.train(X_train, y_train, epochs=50, private=False)
acc_standard = accuracy_score(y_test, model_standard.predict(X_test))
print(f"Non-private model accuracy: {acc_standard:.4f}")

# Train with different privacy levels
noise_levels = [0.1, 0.5, 1.0, 2.0, 5.0]

print("\nTraining DP models with different noise levels:")
print("-" * 50)

for noise in noise_levels:
    model_dp = SimpleNeuralNetwork(20, 32, 1)
    trainer_dp = DPSGDTrainer(
        model_dp,
        clip_norm=1.0,
        noise_multiplier=noise,
        learning_rate=0.5
    )
    trainer_dp.train(X_train, y_train, epochs=50, private=True)
    acc_dp = accuracy_score(y_test, model_dp.predict(X_test))
    
    privacy_level = "High" if noise > 1.0 else "Medium" if noise > 0.3 else "Low"
    print(f"Noise σ={noise:.1f} ({privacy_level:6s} privacy) | Accuracy: {acc_dp:.4f}")

print("\nKey insight: Higher noise = more privacy, but lower accuracy")
print("This is the fundamental privacy-utility trade-off!")

4.7 Privacy Budget Composition

Theorem (Basic Composition): If M₁ is ε₁-DP and M₂ is ε₂-DP, then releasing both is (ε₁ + ε₂)-DP.

Implication: Privacy "budget" depletes with each query!

┌────────────────────────────────────────────────────────────┐
│                PRIVACY BUDGET TRACKING                      │
├────────────────────────────────────────────────────────────┤
│                                                             │
│  Total Budget: ε_total = 10                                │
│                                                             │
│  Query 1: Mean salary      → ε = 2.0  │ Remaining: 8.0    │
│  Query 2: Count by dept    → ε = 1.5  │ Remaining: 6.5    │
│  Query 3: Median age       → ε = 2.0  │ Remaining: 4.5    │
│  Query 4: Distribution     → ε = 3.0  │ Remaining: 1.5    │
│  Query 5: ??? [BLOCKED]    → ε = 2.0  │ Insufficient!     │
│                                                             │
└────────────────────────────────────────────────────────────┘

Section 5: Federated Learning Privacy Considerations (20 minutes)

5.1 What is Federated Learning?

Definition: Federated Learning (FL) is a distributed machine learning approach where the model is trained across multiple decentralized devices or servers holding local data, without exchanging the raw data.

┌─────────────────────────────────────────────────────────────────────┐
│                    FEDERATED LEARNING OVERVIEW                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                        ┌─────────────────┐                          │
│                        │  Central Server │                          │
│                        │   (Aggregator)  │                          │
│                        └────────┬────────┘                          │
│                     ┌───────────┼───────────┐                       │
│                     │           │           │                       │
│               ┌─────▼─────┐ ┌───▼───┐ ┌─────▼─────┐                │
│               │ Client 1  │ │Client 2│ │ Client 3  │                │
│               │(Hospital A)│ │(Bank B)│ │(Phone User)│               │
│               │           │ │       │ │           │                │
│               │ Local     │ │ Local │ │ Local     │                │
│               │ Data D₁   │ │Data D₂│ │ Data D₃   │                │
│               └───────────┘ └───────┘ └───────────┘                │
│                                                                      │
│   Protocol:                                                          │
│   1. Server sends global model to clients                           │
│   2. Clients train locally on their data                            │
│   3. Clients send model updates (gradients) to server               │
│   4. Server aggregates updates → new global model                   │
│   5. Repeat until convergence                                       │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

5.2 Privacy Promise vs. Reality

The Promise:

  • "Data never leaves the device"
  • "Only model updates are shared"
  • Privacy through decentralization

The Reality:

  • Model updates (gradients) leak information!
  • Multiple attack vectors exist
  • Privacy guarantees require additional measures

5.3 Privacy Attacks in Federated Learning

┌─────────────────────────────────────────────────────────────────────┐
│              ATTACKS ON FEDERATED LEARNING                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. GRADIENT LEAKAGE ATTACKS                                        │
│     └─► Reconstruct training data from shared gradients             │
│         • Deep Leakage from Gradients (DLG)                         │
│         • Inverting Gradients (iDLG)                                │
│                                                                      │
│  2. MEMBERSHIP INFERENCE                                            │
│     └─► Determine if specific data was used by a client             │
│         • Analyze gradient patterns                                  │
│         • Observe model behavior changes                            │
│                                                                      │
│  3. MODEL POISONING                                                 │
│     └─► Malicious clients corrupt the global model                  │
│         • Backdoor attacks via gradient manipulation                │
│                                                                      │
│  4. INFERENCE FROM AGGREGATES                                       │
│     └─► Even aggregated updates leak information                    │
│         • Especially with few clients                               │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

5.4 Gradient Leakage Attack Demo

"""
Demo: Gradient Leakage in Federated Learning
Shows how gradients can reveal training data
"""

import numpy as np

def demonstrate_gradient_leakage():
    """
    Simplified demonstration of gradient leakage.
    
    In a linear model y = Wx, the gradient w.r.t. W is:
    ∇W = (Wx - y) * x^T
    
    If we know W and ∇W, we can potentially recover x!
    """
    
    print("=== Gradient Leakage Demonstration ===\n")
    
    # Scenario: Simple linear regression
    # True data point (this is private!)
    x_private = np.array([[3.0, 5.0, 2.0]])  # 1x3 input
    y_private = np.array([[7.0]])  # 1x1 output
    
    print("Private training data (attacker should not know this):")
    print(f"  x = {x_private[0]}")
    print(f"  y = {y_private[0]}")
    print()
    
    # Model weights (public after training round)
    W = np.array([[0.5, 1.2, 0.8]])  # 1x3 weights
    
    # Client computes gradient and shares it
    prediction = x_private @ W.T  # Forward pass
    error = prediction - y_private
    gradient = error.T @ x_private  # ∇W = error * x
    
    print("Shared gradient (what server receives):")
    print(f"  ∇W = {gradient[0]}")
    print()
    
    # ATTACK: Reconstruct x from gradient
    # For single sample: ∇W = error * x, and error = Wx - y
    # If attacker can guess/compute error, they can recover x
    
    # Simplified attack: assuming attacker knows |gradient|/|error| = |x|
    gradient_norm = np.linalg.norm(gradient)
    error_scalar = error[0, 0]  # In practice, attacker might estimate this
    
    # x ≈ gradient / error (element-wise)
    x_reconstructed = gradient / error_scalar
    
    print("Attacker's reconstruction:")
    print(f"  Reconstructed x = {x_reconstructed[0]}")
    print(f"  True x          = {x_private[0]}")
    print(f"  Reconstruction error: {np.linalg.norm(x_reconstructed - x_private):.6f}")
    print()
    print("⚠️  Private data successfully recovered from gradient!")

demonstrate_gradient_leakage()

print("\n" + "="*60)
print("More sophisticated attacks (Deep Leakage from Gradients)")
print("="*60)
print("""
Research has shown that for deep neural networks:

1. Gradients contain enough information to reconstruct inputs
2. Both images and text can be recovered
3. Batch gradients leak individual samples

The attack optimizes:
   x* = argmin ||∇W(x*) - ∇W_shared||²

Starting from random noise, the attacker iteratively refines
their guess until its gradient matches the shared gradient.

This works surprisingly well for:
• Image classification models
• Language models  
• Even with batch sizes > 1

Defense: Differential Privacy on gradients (DP-FL)
""")

5.5 Privacy-Preserving Federated Learning

TechniqueDescriptionTrade-offs
Secure AggregationCryptographic protocols hide individual updatesComputational overhead
DP-FLAdd noise to gradients before sharingReduced model accuracy
Client SelectionRandomly sample participating clientsSlower convergence
Gradient CompressionReduce information in updatesMay leak less, reduces utility
Homomorphic EncryptionCompute on encrypted gradientsVery high computational cost

5.6 Implementation: Simulated Federated Learning with DP

"""
Demo: Federated Learning with Differential Privacy
"""

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

class FederatedLearningSimulator:
    """
    Simulates federated learning with optional differential privacy.
    """
    
    def __init__(self, n_clients=5, epsilon=None):
        self.n_clients = n_clients
        self.epsilon = epsilon  # None means no DP
        self.global_model = None
    
    def partition_data(self, X, y):
        """Partition data among clients (IID)"""
        n = len(X)
        indices = np.random.permutation(n)
        splits = np.array_split(indices, self.n_clients)
        
        client_data = []
        for split in splits:
            client_data.append((X[split], y[split]))
        
        return client_data
    
    def clip_and_noise(self, gradients, clip_norm=1.0):
        """Apply DP to gradients"""
        if self.epsilon is None:
            return gradients
        
        # Clip
        norm = np.linalg.norm(gradients)
        if norm > clip_norm:
            gradients = gradients * (clip_norm / norm)
        
        # Add noise
        noise_scale = clip_norm / self.epsilon
        noise = np.random.laplace(0, noise_scale, gradients.shape)
        
        return gradients + noise
    
    def train_round(self, client_data, n_local_epochs=1):
        """One round of federated training"""
        client_updates = []
        
        for client_id, (X_client, y_client) in enumerate(client_data):
            # Initialize client model with global weights
            client_model = SGDClassifier(
                loss='log_loss',
                max_iter=n_local_epochs,
                warm_start=True,
                random_state=42
            )
            
            # Copy global model if exists
            if self.global_model is not None:
                client_model.coef_ = self.global_model.coef_.copy()
                client_model.intercept_ = self.global_model.intercept_.copy()
                client_model.classes_ = self.global_model.classes_
            
            # Local training
            client_model.fit(X_client, y_client)
            
            # Compute update (difference from global)
            if self.global_model is not None:
                coef_update = client_model.coef_ - self.global_model.coef_
                intercept_update = client_model.intercept_ - self.global_model.intercept_
            else:
                coef_update = client_model.coef_
                intercept_update = client_model.intercept_
            
            # Apply DP if enabled
            coef_update = self.clip_and_noise(coef_update)
            intercept_update = self.clip_and_noise(intercept_update)
            
            client_updates.append((coef_update, intercept_update, client_model))
        
        # Aggregate updates (FedAvg)
        avg_coef_update = np.mean([u[0] for u in client_updates], axis=0)
        avg_intercept_update = np.mean([u[1] for u in client_updates], axis=0)
        
        # Update global model
        if self.global_model is None:
            self.global_model = client_updates[0][2]
        
        self.global_model.coef_ += avg_coef_update
        self.global_model.intercept_ += avg_intercept_update
    
    def train(self, X, y, n_rounds=10, n_local_epochs=1):
        """Full federated training"""
        client_data = self.partition_data(X, y)
        
        for round_num in range(n_rounds):
            self.train_round(client_data, n_local_epochs)
    
    def evaluate(self, X_test, y_test):
        """Evaluate global model"""
        return accuracy_score(y_test, self.global_model.predict(X_test))


# ===== DEMONSTRATION =====

# Generate dataset
X, y = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=15,
    random_state=42
)

# Split into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("=== Federated Learning with Differential Privacy ===\n")

# Train without DP
print("Non-private Federated Learning:")
fl_standard = FederatedLearningSimulator(n_clients=5, epsilon=None)
fl_standard.train(X_train, y_train, n_rounds=20)
acc_standard = fl_standard.evaluate(X_test, y_test)
print(f"  Accuracy: {acc_standard:.4f}")

# Train with different privacy levels
print("\nPrivate Federated Learning (DP-FL):")
print("-" * 45)

epsilons = [10.0, 5.0, 1.0, 0.5, 0.1]
for epsilon in epsilons:
    fl_private = FederatedLearningSimulator(n_clients=5, epsilon=epsilon)
    fl_private.train(X_train, y_train, n_rounds=20)
    acc_private = fl_private.evaluate(X_test, y_test)
    
    privacy_level = "Low" if epsilon > 5 else "Med" if epsilon > 0.5 else "High"
    print(f"  ε={epsilon:5.1f} ({privacy_level:4s} privacy) | Accuracy: {acc_private:.4f}")

print("\n" + "="*50)
print("Privacy-Utility Trade-off Summary")
print("="*50)
print("• Lower ε = Stronger privacy guarantee")
print("• Lower ε = More noise = Lower accuracy")
print("• Real deployments typically use ε between 1-10")
print("• Combine with secure aggregation for defense in depth")

5.7 Summary: Federated Learning Privacy

Key Takeaways:

  1. Federated Learning is NOT inherently private
    • Gradients leak training data information
    • Sophisticated attacks can reconstruct inputs
  2. Defense requires explicit privacy mechanisms
    • Differential privacy on gradients
    • Secure aggregation protocols
    • Combination of techniques
  3. Privacy-Utility Trade-off is fundamental
    • Stronger privacy = Lower accuracy
    • Must balance based on application requirements

Section 6: Wrap-up and Q&A (10 minutes)

6.1 Key Concepts Summary

┌─────────────────────────────────────────────────────────────────────┐
│                    WEEK 5 KEY TAKEAWAYS                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. ML MODELS LEAK TRAINING DATA INFORMATION                        │
│     • Models memorize patterns from training data                   │
│     • This creates exploitable privacy vulnerabilities              │
│                                                                      │
│  2. MEMBERSHIP INFERENCE: Can determine training set membership     │
│     • Exploits confidence gap between training/test data            │
│     • Shadow model attacks are particularly effective               │
│                                                                      │
│  3. MODEL INVERSION: Can reconstruct training data features         │
│     • Optimization-based attacks find data that maximizes output    │
│     • Especially dangerous for facial recognition systems           │
│                                                                      │
│  4. MODEL EXTRACTION: Can steal model functionality                 │
│     • Query access sufficient to create functional copies           │
│     • Enables follow-up white-box attacks                           │
│                                                                      │
│  5. DIFFERENTIAL PRIVACY: Provable privacy guarantees               │
│     • Mathematical framework limiting individual impact             │
│     • Key parameters: ε (privacy budget), sensitivity              │
│                                                                      │
│  6. FEDERATED LEARNING: Distributed but not automatically private   │
│     • Gradients leak information                                    │
│     • Requires DP-FL or secure aggregation for privacy              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

6.2 Looking Ahead: Week 6 Preview

Next week we'll explore LLM Architecture & Attack Surfaces, including:

  • Transformer architecture security considerations
  • Training data risks in LLMs
  • Attack surface analysis for large language models

6.3 Homework Assignment

Assignment 5: Privacy Attack Implementation

  1. Membership Inference (40 points)
    • Implement both threshold and shadow model attacks
    • Compare attack success across different model architectures
    • Analyze which factors increase vulnerability
  2. Differential Privacy (30 points)
    • Implement the Laplace mechanism for a real dataset
    • Experiment with different ε values
    • Plot the privacy-utility trade-off curve
  3. Critical Analysis (30 points)
    • Read: "Membership Inference Attacks Against Machine Learning Models" (Shokri et al., 2017)
    • Write a 2-page analysis of the attack methodology and defenses

Due: Before Week 7 class

6.4 Additional Resources

Papers:

  • Shokri et al. (2017). "Membership Inference Attacks Against Machine Learning Models"
  • Fredrikson et al. (2015). "Model Inversion Attacks that Exploit Confidence Information"
  • Abadi et al. (2016). "Deep Learning with Differential Privacy"
  • Zhu et al. (2019). "Deep Leakage from Gradients"

Tools:

Online Courses:

  • Coursera: "Privacy in Machine Learning"
  • Udacity: "Secure and Private AI"

Appendix: Code Templates

A.1 Quick Reference: Membership Inference Attack

# Minimal membership inference attack template
def membership_inference(model, x, y_true, threshold=0.85):
    """
    Returns True if x is predicted to be in training set
    """
    prob = model.predict_proba([x])[0]
    confidence = prob[y_true]
    return confidence >= threshold

A.2 Quick Reference: Differential Privacy

# Minimal Laplace mechanism
def dp_query(true_value, sensitivity, epsilon):
    """
    Returns differentially private value
    """
    noise = np.random.laplace(0, sensitivity / epsilon)
    return true_value + noise

End of Week 5 Tutorial

Questions? Office Hours: Tuesday/Thursday, 1:00 PM - 3:30 PM via Zoom