Module: Adversarial Machine Learning
Course: CSCI 5773 - Introduction to Emerging Systems Security
Duration: 140-150 minutes
Instructor: Dr. Zhengxiong Li
By the end of this session, students will be able to:
- Understand privacy risks in ML systems - Identify and explain the fundamental privacy threats that emerge when deploying machine learning models
- Implement membership inference attacks - Design and execute attacks that determine whether specific data points were used in training
- Apply differential privacy techniques - Implement privacy-preserving mechanisms to protect training data while maintaining model utility
| Section | Topic | Duration |
|---|
| 1 | Privacy Threats in Machine Learning | 25 min |
| 2 | Membership Inference Attacks | 35 min |
| 3 | Model Inversion and Extraction Attacks | 30 min |
| 4 | Differential Privacy Fundamentals | 30 min |
| 5 | Federated Learning Privacy Considerations | 20 min |
| 6 | Wrap-up and Q&A | 10 min |
Machine learning models are not just mathematical functions—they are compressed representations of their training data. This fundamental property creates significant privacy risks that many practitioners overlook.
The Privacy Paradox in ML:
- We want models that generalize well to new data
- But models inevitably memorize aspects of their training data
- This memorization creates an information leakage channel
┌─────────────────────────────────────────────────────────────────────┐
│ ML PRIVACY THREAT TAXONOMY │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ TRAINING TIME │ │ INFERENCE TIME │ │ MODEL-LEVEL │ │
│ ├─────────────────┤ ├─────────────────┤ ├─────────────────┤ │
│ │ • Data Poisoning│ │ • Membership │ │ • Model │ │
│ │ • Label Leakage │ │ Inference │ │ Extraction │ │
│ │ • Gradient │ │ • Model │ │ • Watermark │ │
│ │ Leakage │ │ Inversion │ │ Removal │ │
│ │ • Insider │ │ • Attribute │ │ • Fine-tuning │ │
│ │ Threats │ │ Inference │ │ Attacks │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Key Concept: Unintentional Memorization
Neural networks can memorize specific training examples, especially:
- Rare or unique data points (outliers)
- Data that appears multiple times
- Data with distinctive patterns
Example: Credit Card Memorization in Language Models
Research has shown that large language models can memorize and reproduce sensitive information like credit card numbers if they appear in training data:
Training data: "My credit card number is 4532-1234-5678-9012"
↓
Model Training
↓
Query: "My credit card number is 4532-"
Model output: "1234-5678-9012" ← Information leakage!
| Attack Type | Attacker's Goal | Required Access | Example Scenario |
|---|
| Membership Inference | Determine if specific data was in training set | Black-box (predictions only) | Was my medical record used to train this diagnostic model? |
| Model Inversion | Reconstruct training data features | Black-box or white-box | Recover faces from a facial recognition model |
| Attribute Inference | Infer sensitive attributes of training data | Black-box | Infer income level from credit model |
| Model Extraction | Steal the model itself | Black-box (queries) | Clone a proprietary fraud detection model |
| Training Data Extraction | Extract verbatim training examples | Black-box | Extract memorized text from LLMs |
"""
Demo: Visualizing how models memorize training data
This example shows how model confidence differs between
training data and unseen data
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate synthetic dataset
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=10,
n_redundant=5,
random_state=42
)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train a model that will overfit (demonstrating memorization)
model = MLPClassifier(
hidden_layer_sizes=(100, 100, 100), # Deep network
max_iter=1000,
random_state=42
)
model.fit(X_train, y_train)
# Get prediction confidences
train_probs = model.predict_proba(X_train)
test_probs = model.predict_proba(X_test)
# Extract confidence for correct class
train_confidence = np.max(train_probs, axis=1)
test_confidence = np.max(test_probs, axis=1)
print(f"Training set - Mean confidence: {train_confidence.mean():.4f}")
print(f"Test set - Mean confidence: {test_confidence.mean():.4f}")
print(f"Confidence gap: {train_confidence.mean() - test_confidence.mean():.4f}")
# This gap is what membership inference attacks exploit!
Expected Output:
Training set - Mean confidence: 0.9847
Test set - Mean confidence: 0.8923
Confidence gap: 0.0924
Key Insight: The confidence gap between training and test data is the fundamental vulnerability that membership inference attacks exploit. Models are systematically more confident about data they've seen during training.
Case Study 1: Netflix Prize (2006-2009)
- Netflix released "anonymized" movie ratings for a recommendation competition
- Researchers de-anonymized users by cross-referencing with IMDb ratings
- Revealed sensitive information: political preferences, sexual orientation
- Lesson: Anonymization alone is insufficient for privacy
Case Study 2: Strava Heat Map (2018)
- Fitness app published aggregate exercise data as heat maps
- Military bases were revealed through running patterns
- Soldier identities could be inferred from consistent routes
- Lesson: Aggregate data can still leak individual information
Case Study 3: GPT-2/GPT-3 Training Data Extraction (2020-2021)
- Researchers extracted memorized content from language models
- Recovered personal information, code snippets, URLs
- Lesson: Large models can memorize and leak training data
Definition: A membership inference attack (MIA) determines whether a specific data record was used in the training dataset of a machine learning model.
┌─────────────────────────────────────────────────────────────┐
│ MEMBERSHIP INFERENCE ATTACK │
├─────────────────────────────────────────────────────────────┤
│ │
│ Attacker has: │
│ • Access to target model (black-box or white-box) │
│ • A specific data record x │
│ • (Optional) Similar data distribution │
│ │
│ Attacker wants to know: │
│ • Was x ∈ Training Set? │
│ │
│ ┌─────────┐ Query(x) ┌─────────────┐ │
│ │ │ ─────────────────► │ Target │ │
│ │Attacker │ │ Model │ │
│ │ │ ◄───────────────── │ │ │
│ └─────────┘ Prediction(x) └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Attack Model determines: │ │
│ │ x ∈ Training Set OR x ∉ Training Set │ │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
The Overfitting Hypothesis:
Machine learning models behave differently on their training data compared to unseen data:
- Higher confidence: Models output higher prediction probabilities for training samples
- Lower loss: Training samples have lower loss values
- Different gradients: Gradient norms differ between members and non-members
| Attack Type | Description | Complexity |
|---|
| Threshold Attack | Simple confidence threshold | Low |
| Shadow Model Attack | Train attack model on shadow models | Medium |
| Label-Only Attack | Uses only predicted labels | Medium |
| Likelihood Ratio Attack | Statistical hypothesis testing | High |
The simplest membership inference uses a confidence threshold:
"""
Demo: Simple Threshold-Based Membership Inference Attack
"""
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
# Load dataset
digits = load_digits()
X, y = digits.data, digits.target
# Split: 50% train (members), 50% test (non-members)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=42
)
# Train target model
target_model = MLPClassifier(
hidden_layer_sizes=(128, 64),
max_iter=500,
random_state=42
)
target_model.fit(X_train, y_train)
def threshold_attack(model, X, y_true, threshold=0.9):
"""
Membership inference using confidence threshold.
Intuition: Training samples typically have higher prediction confidence
Args:
model: Target classifier
X: Data samples to test
y_true: True labels
threshold: Confidence threshold for membership decision
Returns:
membership_predictions: 1 if predicted member, 0 otherwise
"""
# Get prediction probabilities
probs = model.predict_proba(X)
# Get confidence for the correct class
confidences = np.array([
probs[i, y_true[i]] for i in range(len(y_true))
])
# Predict membership based on threshold
membership_predictions = (confidences >= threshold).astype(int)
return membership_predictions, confidences
# Attack training data (should predict "member" = 1)
train_preds, train_conf = threshold_attack(
target_model, X_train, y_train, threshold=0.8
)
# Attack test data (should predict "non-member" = 0)
test_preds, test_conf = threshold_attack(
target_model, X_test, y_test, threshold=0.8
)
# Ground truth: train=1 (member), test=0 (non-member)
y_attack_true = np.concatenate([
np.ones(len(X_train)),
np.zeros(len(X_test))
])
y_attack_pred = np.concatenate([train_preds, test_preds])
# Evaluate attack
print("=== Membership Inference Attack Results ===")
print(f"Attack Accuracy: {accuracy_score(y_attack_true, y_attack_pred):.4f}")
print(f"Attack Precision: {precision_score(y_attack_true, y_attack_pred):.4f}")
print(f"Attack Recall: {recall_score(y_attack_true, y_attack_pred):.4f}")
print(f"\nTraining data avg confidence: {train_conf.mean():.4f}")
print(f"Test data avg confidence: {test_conf.mean():.4f}")
Expected Output:
=== Membership Inference Attack Results ===
Attack Accuracy: 0.6824
Attack Precision: 0.6532
Attack Recall: 0.7891
Training data avg confidence: 0.9234
Test data avg confidence: 0.8156
The shadow model attack is more sophisticated and doesn't require prior knowledge of the optimal threshold.
Attack Pipeline:
┌──────────────────────────────────────────────────────────────────────┐
│ SHADOW MODEL ATTACK PIPELINE │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ STEP 1: Create Shadow Training Data │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Similar distribution to target's training data │ │
│ │ Split into: D_in (members) and D_out (non-members) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STEP 2: Train Shadow Models │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Train k shadow models on different D_in splits │ │
│ │ These mimic the target model's behavior │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STEP 3: Generate Attack Training Data │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Query shadow models with D_in (label=1) and D_out (label=0)│ │
│ │ Features: prediction vector from shadow model │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STEP 4: Train Attack Model │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Binary classifier: prediction vector → member/non-member │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STEP 5: Attack Target Model │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Query target model → Get prediction vector │ │
│ │ Feed to attack model → Get membership prediction │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
"""
Demo: Shadow Model Membership Inference Attack
Complete implementation following Shokri et al. (2017)
"""
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
class ShadowModelAttack:
"""
Membership Inference Attack using Shadow Models
"""
def __init__(self, n_shadow_models=3, target_model_class=MLPClassifier):
self.n_shadow_models = n_shadow_models
self.target_model_class = target_model_class
self.shadow_models = []
self.attack_model = None
def _train_shadow_model(self, X_train, y_train):
"""Train a single shadow model"""
model = self.target_model_class(
hidden_layer_sizes=(64, 32),
max_iter=300,
random_state=np.random.randint(10000)
)
model.fit(X_train, y_train)
return model
def prepare_attack_data(self, X_shadow, y_shadow):
"""
Train shadow models and generate attack training data
Returns:
X_attack: Feature vectors (prediction probabilities)
y_attack: Labels (1=member, 0=non-member)
"""
X_attack_list = []
y_attack_list = []
for i in range(self.n_shadow_models):
# Random split for each shadow model
X_in, X_out, y_in, y_out = train_test_split(
X_shadow, y_shadow,
test_size=0.5,
random_state=i
)
# Train shadow model
shadow_model = self._train_shadow_model(X_in, y_in)
self.shadow_models.append(shadow_model)
# Get predictions for members (in) and non-members (out)
pred_in = shadow_model.predict_proba(X_in)
pred_out = shadow_model.predict_proba(X_out)
# Create attack features: concatenate prediction probs with true label
# This helps the attack model learn class-specific patterns
attack_features_in = np.column_stack([
pred_in,
y_in
])
attack_features_out = np.column_stack([
pred_out,
y_out
])
X_attack_list.extend(attack_features_in)
X_attack_list.extend(attack_features_out)
# Labels: 1 for members, 0 for non-members
y_attack_list.extend([1] * len(X_in))
y_attack_list.extend([0] * len(X_out))
return np.array(X_attack_list), np.array(y_attack_list)
def train_attack_model(self, X_attack, y_attack):
"""Train the attack classifier"""
self.attack_model = LogisticRegression(max_iter=1000)
self.attack_model.fit(X_attack, y_attack)
# Evaluate on attack training data
train_acc = accuracy_score(y_attack, self.attack_model.predict(X_attack))
print(f"Attack model training accuracy: {train_acc:.4f}")
def attack(self, target_model, X_query, y_query):
"""
Perform membership inference attack
Args:
target_model: The model being attacked
X_query: Samples to query
y_query: True labels of query samples
Returns:
membership_predictions: 1 if predicted member, 0 otherwise
"""
# Get target model predictions
target_preds = target_model.predict_proba(X_query)
# Create attack features
attack_features = np.column_stack([target_preds, y_query])
# Predict membership
membership_preds = self.attack_model.predict(attack_features)
membership_probs = self.attack_model.predict_proba(attack_features)[:, 1]
return membership_preds, membership_probs
# ===== DEMONSTRATION =====
# Generate dataset
print("Generating synthetic dataset...")
X, y = make_classification(
n_samples=5000,
n_features=20,
n_informative=15,
n_classes=5,
n_clusters_per_class=2,
random_state=42
)
# Split into target and shadow data (simulating different data sources)
X_target, X_shadow, y_target, y_shadow = train_test_split(
X, y, test_size=0.5, random_state=42
)
# Further split target data into train (members) and test (non-members)
X_target_train, X_target_test, y_target_train, y_target_test = train_test_split(
X_target, y_target, test_size=0.5, random_state=42
)
# Train target model
print("\nTraining target model...")
target_model = MLPClassifier(
hidden_layer_sizes=(64, 32),
max_iter=300,
random_state=42
)
target_model.fit(X_target_train, y_target_train)
print(f"Target model test accuracy: {target_model.score(X_target_test, y_target_test):.4f}")
# Initialize and prepare shadow model attack
print("\nPreparing shadow model attack...")
attack = ShadowModelAttack(n_shadow_models=5)
# Prepare attack training data
X_attack, y_attack = attack.prepare_attack_data(X_shadow, y_shadow)
print(f"Attack training data size: {len(X_attack)}")
# Train attack model
print("\nTraining attack model...")
attack.train_attack_model(X_attack, y_attack)
# Perform attack on target model's training data (members)
print("\n=== Attacking Training Data (Members) ===")
member_preds, member_probs = attack.attack(
target_model, X_target_train, y_target_train
)
member_accuracy = accuracy_score(
np.ones(len(X_target_train)),
member_preds
)
print(f"True Positive Rate (correctly identified members): {member_accuracy:.4f}")
# Perform attack on target model's test data (non-members)
print("\n=== Attacking Test Data (Non-members) ===")
nonmember_preds, nonmember_probs = attack.attack(
target_model, X_target_test, y_target_test
)
nonmember_accuracy = accuracy_score(
np.zeros(len(X_target_test)),
nonmember_preds
)
print(f"True Negative Rate (correctly identified non-members): {nonmember_accuracy:.4f}")
# Overall attack accuracy
all_true = np.concatenate([
np.ones(len(X_target_train)),
np.zeros(len(X_target_test))
])
all_pred = np.concatenate([member_preds, nonmember_preds])
print("\n=== Overall Attack Performance ===")
print(classification_report(all_true, all_pred,
target_names=['Non-member', 'Member']))
| Factor | Impact on Attack Success | Explanation |
|---|
| Model overfitting | ↑ Higher success | Overfitted models memorize training data more |
| Training set size | ↓ Lower success | Larger datasets = less memorization per sample |
| Model complexity | ↑ Higher success | Complex models have more capacity to memorize |
| Number of classes | ↑ Higher success | More classes = more distinguishing information |
| Data uniqueness | ↑ Higher success | Unique/outlier samples are more memorable |
Task: Modify the shadow model attack to answer these questions:
- How does the number of shadow models affect attack accuracy?
- How does the target model's training set size affect vulnerability?
- Which samples are most vulnerable to membership inference?
# Exercise template
def analyze_vulnerability_factors():
"""
TODO: Experiment with different configurations
1. Vary n_shadow_models: [1, 3, 5, 10]
2. Vary target training size: [100, 500, 1000, 2000]
3. Identify most vulnerable samples (highest membership probability)
"""
pass
Definition: Model inversion attacks attempt to reconstruct sensitive features of training data by exploiting the model's learned representations.
Key Insight: The model encodes information about training data in its parameters and predictions. We can "reverse" this encoding to recover input features.
┌─────────────────────────────────────────────────────────────────┐
│ MODEL INVERSION ATTACK │
├─────────────────────────────────────────────────────────────────┤
│ │
│ FORWARD: Training Data → Model → Predictions │
│ │
│ ┌──────────┐ ┌───────┐ ┌────────────┐ │
│ │ Face │ ─────► │ Model │ ─────► │ "Alice" │ │
│ │ Image │ │ │ │ (Label) │ │
│ └──────────┘ └───────┘ └────────────┘ │
│ │
│ INVERSION: Label → Optimization → Reconstructed Data │
│ │
│ ┌────────────┐ ┌───────┐ ┌──────────┐ │
│ │ "Alice" │ ─────►│ Model │◄─────── │ ??? │ │
│ │ (Target) │ │ │ grad │ (Random) │ │
│ └────────────┘ └───────┘ └──────────┘ │
│ │ │ │
│ └───────────────────┘ │
│ Optimize to maximize │
│ P("Alice" | reconstructed) │
│ │
│ RESULT: Reconstructed approximation of Alice's face │
│ │
└─────────────────────────────────────────────────────────────────┘
| Type | Access Required | Attack Goal |
|---|
| Confidence-based | Black-box (probabilities) | Reconstruct average class representation |
| Gradient-based | White-box (gradients) | Reconstruct specific training examples |
| Generative | Black-box + auxiliary data | Generate realistic reconstructions |
"""
Demo: Basic Model Inversion Attack
Reconstructing class-representative features from a classifier
"""
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
# Load and prepare data
iris = load_iris()
X, y = iris.data, iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train target model
model = MLPClassifier(
hidden_layer_sizes=(32, 16),
max_iter=500,
random_state=42
)
model.fit(X_scaled, y)
def model_inversion_attack(model, target_class, n_features,
n_iterations=1000, learning_rate=0.1):
"""
Gradient-free model inversion using confidence scores.
Objective: Find x* that maximizes P(target_class | x)
Uses simple hill climbing with random perturbations.
"""
# Initialize with random features
x_reconstructed = np.random.randn(1, n_features)
best_confidence = 0
best_x = x_reconstructed.copy()
for i in range(n_iterations):
# Get current confidence for target class
probs = model.predict_proba(x_reconstructed)[0]
current_confidence = probs[target_class]
if current_confidence > best_confidence:
best_confidence = current_confidence
best_x = x_reconstructed.copy()
# Random perturbation
perturbation = np.random.randn(1, n_features) * learning_rate
# Try the perturbation
x_new = x_reconstructed + perturbation
new_probs = model.predict_proba(x_new)[0]
new_confidence = new_probs[target_class]
# Accept if confidence improves
if new_confidence > current_confidence:
x_reconstructed = x_new
# Decrease learning rate over time
if i % 200 == 0 and i > 0:
learning_rate *= 0.8
return best_x, best_confidence
# Perform model inversion for each class
print("=== Model Inversion Attack Results ===\n")
print("Feature names:", iris.feature_names)
print()
for target_class in range(3):
class_name = iris.target_names[target_class]
# Attack
reconstructed, confidence = model_inversion_attack(
model, target_class, n_features=4, n_iterations=2000
)
# Convert back to original scale for comparison
reconstructed_original = scaler.inverse_transform(reconstructed)[0]
# Get actual mean of training class
actual_mean = X[y == target_class].mean(axis=0)
print(f"Class: {class_name}")
print(f" Reconstruction confidence: {confidence:.4f}")
print(f" Reconstructed features: {reconstructed_original.round(2)}")
print(f" Actual class mean: {actual_mean.round(2)}")
print(f" Feature-wise error: {np.abs(reconstructed_original - actual_mean).round(2)}")
print()
Expected Output:
=== Model Inversion Attack Results ===
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Class: setosa
Reconstruction confidence: 0.9987
Reconstructed features: [5.02 3.41 1.48 0.26]
Actual class mean: [5.01 3.43 1.46 0.25]
Feature-wise error: [0.01 0.02 0.02 0.01]
Class: versicolor
Reconstruction confidence: 0.9823
Reconstructed features: [5.94 2.78 4.31 1.35]
Actual class mean: [5.94 2.77 4.26 1.33]
Feature-wise error: [0. 0.01 0.05 0.02]
Definition: Model extraction (or model stealing) attacks aim to create a functionally equivalent copy of a target model through query access.
Motivation for Attackers:
- Bypass API costs
- Prepare for white-box attacks
- Steal intellectual property
- Violate licensing agreements
┌─────────────────────────────────────────────────────────────────────┐
│ MODEL EXTRACTION STRATEGIES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. EQUATION SOLVING (for simple models) │
│ ├── Query with carefully chosen inputs │
│ ├── Solve system of equations for parameters │
│ └── Works for: Linear models, decision trees │
│ │
│ 2. KNOWLEDGE DISTILLATION │
│ ├── Query target model with synthetic data │
│ ├── Use (input, prediction) pairs as training data │
│ ├── Train surrogate model to mimic target │
│ └── Works for: Any differentiable model │
│ │
│ 3. ACTIVE LEARNING │
│ ├── Strategically select queries near decision boundaries │
│ ├── Maximize information gain per query │
│ └── More efficient than random sampling │
│ │
└─────────────────────────────────────────────────────────────────────┘
"""
Demo: Model Extraction Attack via Knowledge Distillation
"""
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Create target model (simulating a proprietary model)
print("=== Setting up Target Model (Proprietary) ===")
X_private, y_private = make_moons(n_samples=1000, noise=0.1, random_state=42)
target_model = MLPClassifier(
hidden_layer_sizes=(100, 50, 25),
max_iter=500,
random_state=42
)
target_model.fit(X_private, y_private)
print(f"Target model architecture: MLP with layers (100, 50, 25)")
print(f"Target model private data size: {len(X_private)}")
def extract_model(target_model, n_queries, feature_range,
surrogate_class=RandomForestClassifier):
"""
Extract a model through query access.
Strategy: Knowledge distillation with synthetic queries
Args:
target_model: The model to steal
n_queries: Number of queries to use
feature_range: (min, max) for generating synthetic data
surrogate_class: Type of model to train as surrogate
Returns:
surrogate_model: Extracted model
query_data: The synthetic queries used
"""
# Generate synthetic query data
# Using uniform random sampling in the feature space
n_features = 2
min_val, max_val = feature_range
X_synthetic = np.random.uniform(
min_val, max_val,
size=(n_queries, n_features)
)
# Query target model to get labels
y_synthetic = target_model.predict(X_synthetic)
# Also get soft labels (probabilities) for better extraction
y_probs = target_model.predict_proba(X_synthetic)
# Train surrogate model on synthetic labeled data
surrogate_model = surrogate_class(n_estimators=100, random_state=42)
surrogate_model.fit(X_synthetic, y_synthetic)
return surrogate_model, X_synthetic, y_synthetic
# Perform extraction with different query budgets
query_budgets = [50, 100, 500, 1000, 5000]
results = []
print("\n=== Model Extraction Attack ===")
print(f"Target model accuracy on private data: {target_model.score(X_private, y_private):.4f}")
print()
for n_queries in query_budgets:
surrogate, X_syn, y_syn = extract_model(
target_model,
n_queries=n_queries,
feature_range=(-2, 3)
)
# Evaluate how well surrogate mimics target
# Test on the private data (attacker doesn't have this, but we use for evaluation)
target_preds = target_model.predict(X_private)
surrogate_preds = surrogate.predict(X_private)
# Fidelity: how often surrogate agrees with target
fidelity = accuracy_score(target_preds, surrogate_preds)
# Accuracy: how well surrogate performs on true task
accuracy = accuracy_score(y_private, surrogate_preds)
results.append({
'queries': n_queries,
'fidelity': fidelity,
'accuracy': accuracy
})
print(f"Queries: {n_queries:5d} | Fidelity: {fidelity:.4f} | Accuracy: {accuracy:.4f}")
print("\n=== Key Insights ===")
print("• Fidelity measures how well the stolen model mimics the target")
print("• With enough queries, we can extract a high-fidelity copy")
print("• The extracted model can then be used for white-box attacks")
Expected Output:
=== Setting up Target Model (Proprietary) ===
Target model architecture: MLP with layers (100, 50, 25)
Target model private data size: 1000
=== Model Extraction Attack ===
Target model accuracy on private data: 0.9950
Queries: 50 | Fidelity: 0.8340 | Accuracy: 0.8290
Queries: 100 | Fidelity: 0.8910 | Accuracy: 0.8850
Queries: 500 | Fidelity: 0.9580 | Accuracy: 0.9530
Queries: 1000 | Fidelity: 0.9780 | Accuracy: 0.9750
Queries: 5000 | Fidelity: 0.9920 | Accuracy: 0.9890
=== Key Insights ===
• Fidelity measures how well the stolen model mimics the target
• With enough queries, we can extract a high-fidelity copy
• The extracted model can then be used for white-box attacks
| Defense | Mechanism | Trade-offs |
|---|
| Prediction Perturbation | Add noise to confidence scores | Reduces utility |
| Confidence Masking | Only return top-k classes | Limits functionality |
| Query Rate Limiting | Restrict queries per user | Affects legitimate users |
| Watermarking | Embed identifiable patterns | Requires verification mechanism |
| Differential Privacy | Mathematically bounded leakage | Reduces accuracy |
Definition: Differential privacy is a mathematical framework that provides provable privacy guarantees by ensuring that any single individual's data has a limited impact on the output of a computation.
Intuition: An algorithm is differentially private if its output doesn't change much whether or not any single individual's data is included.
┌─────────────────────────────────────────────────────────────────────┐
│ DIFFERENTIAL PRIVACY INTUITION │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Database D = {Alice, Bob, Carol, Dave, Eve} │
│ Database D' = {Alice, Bob, Carol, Dave} (Eve removed) │
│ │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ D (with │──► Mechanism M(D) ──► │ Output ~ │ │
│ │ Eve) │ │ similar │ │
│ └─────────────┘ └─────────────┘ │
│ ≈ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ D' (without│──► Mechanism M(D')──► │ Output ~ │ │
│ │ Eve) │ │ similar │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ If outputs are similar, Eve's privacy is protected! │
│ │
└─────────────────────────────────────────────────────────────────────┘
ε-Differential Privacy:
A randomized mechanism M provides ε-differential privacy if for all datasets D and D' differing in at most one element, and for all possible outputs S:
P[M(D) ∈ S] ≤ e^ε × P[M(D') ∈ S]
Understanding ε (epsilon):
- ε = 0: Perfect privacy (completely random output)
- ε → ∞: No privacy (deterministic output)
- Typical values: ε ∈ 0.1, 10
| ε Value | Privacy Level | Use Case |
|---|
| 0.1 | Very high | Sensitive medical data |
| 1.0 | Standard | General analytics |
| 5.0 | Low | Public statistics |
| 10+ | Minimal | Non-sensitive data |
The most common method to achieve differential privacy:
For a function f: D → R with sensitivity Δf:
M(D) = f(D) + Laplace(Δf / ε)
Where:
- Sensitivity (Δf): Maximum change in f when one record changes
- Laplace(b): Noise drawn from Laplace distribution with scale b
"""
Demo: Implementing Differential Privacy from Scratch
"""
import numpy as np
import matplotlib.pyplot as plt
class LaplaceMechanism:
"""
Implements the Laplace mechanism for differential privacy.
"""
def __init__(self, epsilon):
"""
Args:
epsilon: Privacy budget (smaller = more private)
"""
self.epsilon = epsilon
def add_noise(self, true_value, sensitivity):
"""
Add Laplace noise to achieve ε-differential privacy.
Args:
true_value: The actual computation result
sensitivity: Maximum change when one record changes
Returns:
Noisy value that satisfies ε-DP
"""
scale = sensitivity / self.epsilon
noise = np.random.laplace(0, scale)
return true_value + noise
def private_count(self, data, predicate):
"""
Count elements satisfying predicate with DP.
Sensitivity of count = 1 (one person can change count by at most 1)
"""
true_count = sum(predicate(x) for x in data)
return self.add_noise(true_count, sensitivity=1)
def private_mean(self, data, data_range):
"""
Compute mean with DP.
Sensitivity of mean = range / n
"""
n = len(data)
true_mean = np.mean(data)
sensitivity = data_range / n
return self.add_noise(true_mean, sensitivity)
def private_sum(self, data, max_contribution):
"""
Compute sum with DP.
Sensitivity = max contribution per individual
"""
# Clip individual contributions
clipped_data = np.clip(data, 0, max_contribution)
true_sum = np.sum(clipped_data)
return self.add_noise(true_sum, sensitivity=max_contribution)
# ===== DEMONSTRATION =====
# Create synthetic salary dataset
np.random.seed(42)
n_employees = 1000
salaries = np.random.normal(75000, 15000, n_employees)
salaries = np.clip(salaries, 30000, 200000) # Realistic range
print("=== Differential Privacy Demo: Salary Statistics ===\n")
print(f"Dataset size: {n_employees} employees")
print(f"True mean salary: ${np.mean(salaries):,.2f}")
print(f"True total payroll: ${np.sum(salaries):,.2f}")
print()
# Test with different privacy budgets
epsilons = [0.1, 0.5, 1.0, 5.0, 10.0]
print("Private Mean Salary (multiple runs to show noise variance):")
print("-" * 60)
for epsilon in epsilons:
dp = LaplaceMechanism(epsilon)
# Run multiple times to show variance
private_means = []
for _ in range(5):
private_mean = dp.private_mean(
salaries,
data_range=200000-30000 # max - min salary
)
private_means.append(private_mean)
avg_private = np.mean(private_means)
std_private = np.std(private_means)
error = abs(avg_private - np.mean(salaries))
print(f"ε={epsilon:4.1f} | Private means: ${avg_private:,.0f} "
f"(±${std_private:,.0f}) | Error: ${error:,.0f}")
print()
print("Private Count: Employees earning > $80,000")
print("-" * 60)
true_count = sum(salaries > 80000)
print(f"True count: {true_count}")
for epsilon in [0.5, 1.0, 5.0]:
dp = LaplaceMechanism(epsilon)
private_counts = [
dp.private_count(salaries, lambda x: x > 80000)
for _ in range(5)
]
print(f"ε={epsilon:.1f} | Private counts: {[int(c) for c in private_counts]}")
Expected Output:
=== Differential Privacy Demo: Salary Statistics ===
Dataset size: 1000 employees
True mean salary: $74,892.35
True total payroll: $74,892,347.23
Private Mean Salary (multiple runs to show noise variance):
------------------------------------------------------------
ε= 0.1 | Private means: $76,543 (±$1,892) | Error: $1,651
ε= 0.5 | Private means: $75,012 (±$342) | Error: $120
ε= 1.0 | Private means: $74,923 (±$178) | Error: $31
ε= 5.0 | Private means: $74,889 (±$35) | Error: $3
ε=10.0 | Private means: $74,891 (±$17) | Error: $1
Private Count: Employees earning > $80,000
------------------------------------------------------------
True count: 371
ε=0.5 | Private counts: [373, 369, 375, 367, 372]
ε=1.0 | Private counts: [372, 370, 371, 372, 370]
ε=5.0 | Private counts: [371, 371, 371, 371, 371]
DP-SGD (Differentially Private Stochastic Gradient Descent) modifies the training process to provide privacy guarantees:
┌─────────────────────────────────────────────────────────────────────┐
│ DP-SGD ALGORITHM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Standard SGD: │
│ θ_{t+1} = θ_t - η · (1/B) Σᵢ ∇L(θ_t, xᵢ) │
│ │
│ DP-SGD adds two steps: │
│ │
│ 1. GRADIENT CLIPPING (bound sensitivity) │
│ g̃ᵢ = gᵢ / max(1, ||gᵢ||₂ / C) │
│ │
│ 2. NOISE ADDITION (add calibrated noise) │
│ θ_{t+1} = θ_t - η · [(1/B) Σᵢ g̃ᵢ + N(0, σ²C²I)] │
│ │
│ Where: │
│ • C = clipping threshold │
│ • σ = noise multiplier (determined by privacy budget) │
│ • B = batch size │
│ │
└─────────────────────────────────────────────────────────────────────┘
"""
Demo: Simplified DP-SGD Implementation
"""
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
class SimpleNeuralNetwork:
"""Simple 2-layer neural network for demonstration"""
def __init__(self, input_dim, hidden_dim, output_dim):
# Initialize weights
self.W1 = np.random.randn(input_dim, hidden_dim) * 0.1
self.b1 = np.zeros(hidden_dim)
self.W2 = np.random.randn(hidden_dim, output_dim) * 0.1
self.b2 = np.zeros(output_dim)
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def forward(self, X):
self.z1 = X @ self.W1 + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = self.a1 @ self.W2 + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def compute_gradients(self, X, y):
"""Compute gradients for a single sample"""
m = X.shape[0]
# Forward pass
output = self.forward(X)
# Backward pass
dz2 = output - y.reshape(-1, 1)
dW2 = self.a1.T @ dz2 / m
db2 = np.mean(dz2, axis=0)
da1 = dz2 @ self.W2.T
dz1 = da1 * self.a1 * (1 - self.a1)
dW1 = X.T @ dz1 / m
db1 = np.mean(dz1, axis=0)
return {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
def predict(self, X):
return (self.forward(X) > 0.5).astype(int).flatten()
class DPSGDTrainer:
"""DP-SGD Trainer with gradient clipping and noise addition"""
def __init__(self, model, clip_norm=1.0, noise_multiplier=1.0,
learning_rate=0.1):
self.model = model
self.clip_norm = clip_norm
self.noise_multiplier = noise_multiplier
self.lr = learning_rate
def clip_gradient(self, grad_dict):
"""Clip gradient to have maximum L2 norm of clip_norm"""
# Compute total gradient norm
total_norm = 0
for key in grad_dict:
total_norm += np.sum(grad_dict[key] ** 2)
total_norm = np.sqrt(total_norm)
# Clip if necessary
clip_factor = min(1.0, self.clip_norm / (total_norm + 1e-6))
clipped = {}
for key in grad_dict:
clipped[key] = grad_dict[key] * clip_factor
return clipped
def add_noise(self, grad_dict):
"""Add Gaussian noise calibrated to the sensitivity"""
noisy = {}
for key in grad_dict:
noise = np.random.normal(
0,
self.noise_multiplier * self.clip_norm,
grad_dict[key].shape
)
noisy[key] = grad_dict[key] + noise
return noisy
def train_step(self, X_batch, y_batch, private=True):
"""Perform one training step"""
# Compute per-sample gradients and clip
batch_grads = {'W1': [], 'b1': [], 'W2': [], 'b2': []}
for i in range(len(X_batch)):
# Compute gradient for single sample
grad = self.model.compute_gradients(
X_batch[i:i+1],
y_batch[i:i+1]
)
if private:
# Clip individual gradient
grad = self.clip_gradient(grad)
for key in grad:
batch_grads[key].append(grad[key])
# Average gradients
avg_grads = {}
for key in batch_grads:
avg_grads[key] = np.mean(batch_grads[key], axis=0)
if private:
# Add noise
avg_grads = self.add_noise(avg_grads)
# Update model
self.model.W1 -= self.lr * avg_grads['W1']
self.model.b1 -= self.lr * avg_grads['b1']
self.model.W2 -= self.lr * avg_grads['W2']
self.model.b2 -= self.lr * avg_grads['b2']
def train(self, X, y, epochs=10, batch_size=32, private=True):
"""Train the model"""
n = len(X)
for epoch in range(epochs):
# Shuffle data
indices = np.random.permutation(n)
X_shuffled = X[indices]
y_shuffled = y[indices]
# Mini-batch training
for i in range(0, n, batch_size):
X_batch = X_shuffled[i:i+batch_size]
y_batch = y_shuffled[i:i+batch_size]
self.train_step(X_batch, y_batch, private=private)
# ===== DEMONSTRATION =====
# Generate dataset
X, y = make_classification(
n_samples=2000,
n_features=20,
n_informative=10,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("=== DP-SGD vs Standard SGD Comparison ===\n")
# Train non-private model
print("Training non-private model...")
model_standard = SimpleNeuralNetwork(20, 32, 1)
trainer_standard = DPSGDTrainer(
model_standard,
learning_rate=0.5
)
trainer_standard.train(X_train, y_train, epochs=50, private=False)
acc_standard = accuracy_score(y_test, model_standard.predict(X_test))
print(f"Non-private model accuracy: {acc_standard:.4f}")
# Train with different privacy levels
noise_levels = [0.1, 0.5, 1.0, 2.0, 5.0]
print("\nTraining DP models with different noise levels:")
print("-" * 50)
for noise in noise_levels:
model_dp = SimpleNeuralNetwork(20, 32, 1)
trainer_dp = DPSGDTrainer(
model_dp,
clip_norm=1.0,
noise_multiplier=noise,
learning_rate=0.5
)
trainer_dp.train(X_train, y_train, epochs=50, private=True)
acc_dp = accuracy_score(y_test, model_dp.predict(X_test))
privacy_level = "High" if noise > 1.0 else "Medium" if noise > 0.3 else "Low"
print(f"Noise σ={noise:.1f} ({privacy_level:6s} privacy) | Accuracy: {acc_dp:.4f}")
print("\nKey insight: Higher noise = more privacy, but lower accuracy")
print("This is the fundamental privacy-utility trade-off!")
Theorem (Basic Composition): If M₁ is ε₁-DP and M₂ is ε₂-DP, then releasing both is (ε₁ + ε₂)-DP.
Implication: Privacy "budget" depletes with each query!
┌────────────────────────────────────────────────────────────┐
│ PRIVACY BUDGET TRACKING │
├────────────────────────────────────────────────────────────┤
│ │
│ Total Budget: ε_total = 10 │
│ │
│ Query 1: Mean salary → ε = 2.0 │ Remaining: 8.0 │
│ Query 2: Count by dept → ε = 1.5 │ Remaining: 6.5 │
│ Query 3: Median age → ε = 2.0 │ Remaining: 4.5 │
│ Query 4: Distribution → ε = 3.0 │ Remaining: 1.5 │
│ Query 5: ??? [BLOCKED] → ε = 2.0 │ Insufficient! │
│ │
└────────────────────────────────────────────────────────────┘
Definition: Federated Learning (FL) is a distributed machine learning approach where the model is trained across multiple decentralized devices or servers holding local data, without exchanging the raw data.
┌─────────────────────────────────────────────────────────────────────┐
│ FEDERATED LEARNING OVERVIEW │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ Central Server │ │
│ │ (Aggregator) │ │
│ └────────┬────────┘ │
│ ┌───────────┼───────────┐ │
│ │ │ │ │
│ ┌─────▼─────┐ ┌───▼───┐ ┌─────▼─────┐ │
│ │ Client 1 │ │Client 2│ │ Client 3 │ │
│ │(Hospital A)│ │(Bank B)│ │(Phone User)│ │
│ │ │ │ │ │ │ │
│ │ Local │ │ Local │ │ Local │ │
│ │ Data D₁ │ │Data D₂│ │ Data D₃ │ │
│ └───────────┘ └───────┘ └───────────┘ │
│ │
│ Protocol: │
│ 1. Server sends global model to clients │
│ 2. Clients train locally on their data │
│ 3. Clients send model updates (gradients) to server │
│ 4. Server aggregates updates → new global model │
│ 5. Repeat until convergence │
│ │
└─────────────────────────────────────────────────────────────────────┘
The Promise:
- "Data never leaves the device"
- "Only model updates are shared"
- Privacy through decentralization
The Reality:
- Model updates (gradients) leak information!
- Multiple attack vectors exist
- Privacy guarantees require additional measures
┌─────────────────────────────────────────────────────────────────────┐
│ ATTACKS ON FEDERATED LEARNING │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. GRADIENT LEAKAGE ATTACKS │
│ └─► Reconstruct training data from shared gradients │
│ • Deep Leakage from Gradients (DLG) │
│ • Inverting Gradients (iDLG) │
│ │
│ 2. MEMBERSHIP INFERENCE │
│ └─► Determine if specific data was used by a client │
│ • Analyze gradient patterns │
│ • Observe model behavior changes │
│ │
│ 3. MODEL POISONING │
│ └─► Malicious clients corrupt the global model │
│ • Backdoor attacks via gradient manipulation │
│ │
│ 4. INFERENCE FROM AGGREGATES │
│ └─► Even aggregated updates leak information │
│ • Especially with few clients │
│ │
└─────────────────────────────────────────────────────────────────────┘
"""
Demo: Gradient Leakage in Federated Learning
Shows how gradients can reveal training data
"""
import numpy as np
def demonstrate_gradient_leakage():
"""
Simplified demonstration of gradient leakage.
In a linear model y = Wx, the gradient w.r.t. W is:
∇W = (Wx - y) * x^T
If we know W and ∇W, we can potentially recover x!
"""
print("=== Gradient Leakage Demonstration ===\n")
# Scenario: Simple linear regression
# True data point (this is private!)
x_private = np.array([[3.0, 5.0, 2.0]]) # 1x3 input
y_private = np.array([[7.0]]) # 1x1 output
print("Private training data (attacker should not know this):")
print(f" x = {x_private[0]}")
print(f" y = {y_private[0]}")
print()
# Model weights (public after training round)
W = np.array([[0.5, 1.2, 0.8]]) # 1x3 weights
# Client computes gradient and shares it
prediction = x_private @ W.T # Forward pass
error = prediction - y_private
gradient = error.T @ x_private # ∇W = error * x
print("Shared gradient (what server receives):")
print(f" ∇W = {gradient[0]}")
print()
# ATTACK: Reconstruct x from gradient
# For single sample: ∇W = error * x, and error = Wx - y
# If attacker can guess/compute error, they can recover x
# Simplified attack: assuming attacker knows |gradient|/|error| = |x|
gradient_norm = np.linalg.norm(gradient)
error_scalar = error[0, 0] # In practice, attacker might estimate this
# x ≈ gradient / error (element-wise)
x_reconstructed = gradient / error_scalar
print("Attacker's reconstruction:")
print(f" Reconstructed x = {x_reconstructed[0]}")
print(f" True x = {x_private[0]}")
print(f" Reconstruction error: {np.linalg.norm(x_reconstructed - x_private):.6f}")
print()
print("⚠️ Private data successfully recovered from gradient!")
demonstrate_gradient_leakage()
print("\n" + "="*60)
print("More sophisticated attacks (Deep Leakage from Gradients)")
print("="*60)
print("""
Research has shown that for deep neural networks:
1. Gradients contain enough information to reconstruct inputs
2. Both images and text can be recovered
3. Batch gradients leak individual samples
The attack optimizes:
x* = argmin ||∇W(x*) - ∇W_shared||²
Starting from random noise, the attacker iteratively refines
their guess until its gradient matches the shared gradient.
This works surprisingly well for:
• Image classification models
• Language models
• Even with batch sizes > 1
Defense: Differential Privacy on gradients (DP-FL)
""")
| Technique | Description | Trade-offs |
|---|
| Secure Aggregation | Cryptographic protocols hide individual updates | Computational overhead |
| DP-FL | Add noise to gradients before sharing | Reduced model accuracy |
| Client Selection | Randomly sample participating clients | Slower convergence |
| Gradient Compression | Reduce information in updates | May leak less, reduces utility |
| Homomorphic Encryption | Compute on encrypted gradients | Very high computational cost |
"""
Demo: Federated Learning with Differential Privacy
"""
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
class FederatedLearningSimulator:
"""
Simulates federated learning with optional differential privacy.
"""
def __init__(self, n_clients=5, epsilon=None):
self.n_clients = n_clients
self.epsilon = epsilon # None means no DP
self.global_model = None
def partition_data(self, X, y):
"""Partition data among clients (IID)"""
n = len(X)
indices = np.random.permutation(n)
splits = np.array_split(indices, self.n_clients)
client_data = []
for split in splits:
client_data.append((X[split], y[split]))
return client_data
def clip_and_noise(self, gradients, clip_norm=1.0):
"""Apply DP to gradients"""
if self.epsilon is None:
return gradients
# Clip
norm = np.linalg.norm(gradients)
if norm > clip_norm:
gradients = gradients * (clip_norm / norm)
# Add noise
noise_scale = clip_norm / self.epsilon
noise = np.random.laplace(0, noise_scale, gradients.shape)
return gradients + noise
def train_round(self, client_data, n_local_epochs=1):
"""One round of federated training"""
client_updates = []
for client_id, (X_client, y_client) in enumerate(client_data):
# Initialize client model with global weights
client_model = SGDClassifier(
loss='log_loss',
max_iter=n_local_epochs,
warm_start=True,
random_state=42
)
# Copy global model if exists
if self.global_model is not None:
client_model.coef_ = self.global_model.coef_.copy()
client_model.intercept_ = self.global_model.intercept_.copy()
client_model.classes_ = self.global_model.classes_
# Local training
client_model.fit(X_client, y_client)
# Compute update (difference from global)
if self.global_model is not None:
coef_update = client_model.coef_ - self.global_model.coef_
intercept_update = client_model.intercept_ - self.global_model.intercept_
else:
coef_update = client_model.coef_
intercept_update = client_model.intercept_
# Apply DP if enabled
coef_update = self.clip_and_noise(coef_update)
intercept_update = self.clip_and_noise(intercept_update)
client_updates.append((coef_update, intercept_update, client_model))
# Aggregate updates (FedAvg)
avg_coef_update = np.mean([u[0] for u in client_updates], axis=0)
avg_intercept_update = np.mean([u[1] for u in client_updates], axis=0)
# Update global model
if self.global_model is None:
self.global_model = client_updates[0][2]
self.global_model.coef_ += avg_coef_update
self.global_model.intercept_ += avg_intercept_update
def train(self, X, y, n_rounds=10, n_local_epochs=1):
"""Full federated training"""
client_data = self.partition_data(X, y)
for round_num in range(n_rounds):
self.train_round(client_data, n_local_epochs)
def evaluate(self, X_test, y_test):
"""Evaluate global model"""
return accuracy_score(y_test, self.global_model.predict(X_test))
# ===== DEMONSTRATION =====
# Generate dataset
X, y = make_classification(
n_samples=5000,
n_features=20,
n_informative=15,
random_state=42
)
# Split into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("=== Federated Learning with Differential Privacy ===\n")
# Train without DP
print("Non-private Federated Learning:")
fl_standard = FederatedLearningSimulator(n_clients=5, epsilon=None)
fl_standard.train(X_train, y_train, n_rounds=20)
acc_standard = fl_standard.evaluate(X_test, y_test)
print(f" Accuracy: {acc_standard:.4f}")
# Train with different privacy levels
print("\nPrivate Federated Learning (DP-FL):")
print("-" * 45)
epsilons = [10.0, 5.0, 1.0, 0.5, 0.1]
for epsilon in epsilons:
fl_private = FederatedLearningSimulator(n_clients=5, epsilon=epsilon)
fl_private.train(X_train, y_train, n_rounds=20)
acc_private = fl_private.evaluate(X_test, y_test)
privacy_level = "Low" if epsilon > 5 else "Med" if epsilon > 0.5 else "High"
print(f" ε={epsilon:5.1f} ({privacy_level:4s} privacy) | Accuracy: {acc_private:.4f}")
print("\n" + "="*50)
print("Privacy-Utility Trade-off Summary")
print("="*50)
print("• Lower ε = Stronger privacy guarantee")
print("• Lower ε = More noise = Lower accuracy")
print("• Real deployments typically use ε between 1-10")
print("• Combine with secure aggregation for defense in depth")
Key Takeaways:
- Federated Learning is NOT inherently private
- Gradients leak training data information
- Sophisticated attacks can reconstruct inputs
- Defense requires explicit privacy mechanisms
- Differential privacy on gradients
- Secure aggregation protocols
- Combination of techniques
- Privacy-Utility Trade-off is fundamental
- Stronger privacy = Lower accuracy
- Must balance based on application requirements
┌─────────────────────────────────────────────────────────────────────┐
│ WEEK 5 KEY TAKEAWAYS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. ML MODELS LEAK TRAINING DATA INFORMATION │
│ • Models memorize patterns from training data │
│ • This creates exploitable privacy vulnerabilities │
│ │
│ 2. MEMBERSHIP INFERENCE: Can determine training set membership │
│ • Exploits confidence gap between training/test data │
│ • Shadow model attacks are particularly effective │
│ │
│ 3. MODEL INVERSION: Can reconstruct training data features │
│ • Optimization-based attacks find data that maximizes output │
│ • Especially dangerous for facial recognition systems │
│ │
│ 4. MODEL EXTRACTION: Can steal model functionality │
│ • Query access sufficient to create functional copies │
│ • Enables follow-up white-box attacks │
│ │
│ 5. DIFFERENTIAL PRIVACY: Provable privacy guarantees │
│ • Mathematical framework limiting individual impact │
│ • Key parameters: ε (privacy budget), sensitivity │
│ │
│ 6. FEDERATED LEARNING: Distributed but not automatically private │
│ • Gradients leak information │
│ • Requires DP-FL or secure aggregation for privacy │
│ │
└─────────────────────────────────────────────────────────────────────┘
Next week we'll explore LLM Architecture & Attack Surfaces, including:
- Transformer architecture security considerations
- Training data risks in LLMs
- Attack surface analysis for large language models
Assignment 5: Privacy Attack Implementation
- Membership Inference (40 points)
- Implement both threshold and shadow model attacks
- Compare attack success across different model architectures
- Analyze which factors increase vulnerability
- Differential Privacy (30 points)
- Implement the Laplace mechanism for a real dataset
- Experiment with different ε values
- Plot the privacy-utility trade-off curve
- Critical Analysis (30 points)
- Read: "Membership Inference Attacks Against Machine Learning Models" (Shokri et al., 2017)
- Write a 2-page analysis of the attack methodology and defenses
Due: Before Week 7 class
Papers:
- Shokri et al. (2017). "Membership Inference Attacks Against Machine Learning Models"
- Fredrikson et al. (2015). "Model Inversion Attacks that Exploit Confidence Information"
- Abadi et al. (2016). "Deep Learning with Differential Privacy"
- Zhu et al. (2019). "Deep Leakage from Gradients"
Tools:
Online Courses:
- Coursera: "Privacy in Machine Learning"
- Udacity: "Secure and Private AI"
# Minimal membership inference attack template
def membership_inference(model, x, y_true, threshold=0.85):
"""
Returns True if x is predicted to be in training set
"""
prob = model.predict_proba([x])[0]
confidence = prob[y_true]
return confidence >= threshold
# Minimal Laplace mechanism
def dp_query(true_value, sensitivity, epsilon):
"""
Returns differentially private value
"""
noise = np.random.laplace(0, sensitivity / epsilon)
return true_value + noise
End of Week 5 Tutorial
Questions? Office Hours: Tuesday/Thursday, 1:00 PM - 3:30 PM via Zoom