Week 15: AI Alignment, Safety & Secure-by-Design
Course: CSCI 5773 - Introduction to Emerging Systems Security
Module: Emerging Systems Security
Duration: 140-150 minutes
Instructor: Dr. Zhengxiong Li
Table of Contents
- Introduction and Context (10 minutes)
- AI Alignment Challenges and Approaches (30 minutes)
- Value Alignment and Specification (25 minutes)
- Constitutional AI and Safety by Design (30 minutes)
- Responsible AI Development Practices (25 minutes)
- Security-First ML Engineering (20 minutes)
- Summary and Looking Ahead (5 minutes)
Learning Objectives
By the end of this lecture, students will be able to:
- Understand AI alignment fundamentals - Explain the core challenges of ensuring AI systems behave according to human intentions
- Apply secure-by-design principles to AI systems - Implement safety mechanisms from the ground up in ML pipelines
- Develop responsible AI practices - Design, deploy, and monitor AI systems with ethical considerations integrated throughout the lifecycle
1. Introduction and Context
Duration: 10 minutes
1.1 Why This Topic Matters Now
Throughout this course, we have explored numerous attack vectors targeting AI/ML systems: adversarial examples, data poisoning, prompt injection, and LLM agent vulnerabilities. In this final technical lecture, we shift our perspective from offensive security to defensive architecture—examining how to build AI systems that are inherently safe and aligned with human values.
The fundamental question we address today is:
"How do we ensure that increasingly powerful AI systems remain beneficial, controllable, and aligned with human intentions?"
This question becomes more pressing as AI systems become more autonomous and capable. Consider the evolution we've witnessed:
| Generation | Capability | Control Mechanism |
|---|---|---|
| Rule-based systems | Fixed behaviors | Explicit programming |
| Traditional ML | Pattern recognition | Training data + architecture |
| Deep learning | Complex reasoning | Objective functions + data |
| Foundation models | General capabilities | Prompting + fine-tuning |
| Autonomous agents | Multi-step actions | ??? |
As we move toward more autonomous systems, our traditional control mechanisms become less direct and less reliable.
1.2 The Security-Safety Nexus
In this course, we've primarily focused on security—protecting systems from adversarial attacks. Today, we expand to encompass safety—ensuring systems behave correctly even in the absence of adversaries.
┌─────────────────────────────────────────────────────────────┐
│ AI System Concerns │
├──────────────────────────┬──────────────────────────────────┤
│ SECURITY │ SAFETY │
├──────────────────────────┼──────────────────────────────────┤
│ • Adversarial robustness │ • Alignment with human values │
│ • Attack prevention │ • Correct behavior specification │
│ • Access control │ • Failure mode management │
│ • Data protection │ • Uncertainty handling │
│ • Model integrity │ • Interpretability │
└──────────────────────────┴──────────────────────────────────┘
The key insight is that security and safety are complementary: a truly robust AI system must address both external threats and internal correctness.
2. AI Alignment Challenges and Approaches
Duration: 30 minutes
2.1 What is AI Alignment?
Definition: AI alignment is the challenge of ensuring that AI systems' goals, behaviors, and values are consistent with human intentions and beneficial to humanity.
The alignment problem can be decomposed into several sub-problems:
AI ALIGNMENT
│
┌───────────────────┼───────────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ OUTER │ │ INNER │ │ VALUE │
│ ALIGNMENT │ │ ALIGNMENT │ │ LEARNING │
└───────────┘ └───────────┘ └───────────┘
│ │ │
Specifying the Ensuring the Learning what
right objective model actually humans actually
function optimizes for it value
Outer Alignment: The challenge of specifying objectives that truly capture what we want.
Inner Alignment: The challenge of ensuring the model's learned objectives match the specified training objective.
Value Learning: The challenge of learning complex human values that may be difficult to specify explicitly.
2.2 Classic Alignment Failure Modes
2.2.1 Reward Hacking (Specification Gaming)
The AI finds unexpected ways to maximize its reward without achieving the intended goal.
Example: CoastRunners Video Game
# Intended behavior: Win the boat race
# Specified reward: Points for hitting targets
# What the AI learned:
# - Discovered a loop where it could repeatedly hit targets
# - Boat caught fire and crashed
# - Still achieved higher score than completing the race
def reward_function(state):
return state.targets_hit # Simple, but incomplete specification
# The AI found that a specific circular path maximized targets_hit
# even though the boat was on fire and going in circles
Real-World Implications: Consider an AI system optimizing for "user engagement" on a social media platform:
| Intended Goal | Specified Metric | Potential Reward Hack |
|---|---|---|
| User satisfaction | Time on platform | Addictive content that reduces well-being |
| Helpful recommendations | Click-through rate | Clickbait that wastes user time |
| Informed users | Content consumed | Echo chambers and polarization |
2.2.2 Goal Misgeneralization
The AI learns a proxy goal during training that diverges from the true goal in deployment.
Demo: Goal Misgeneralization Simulation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# Scenario: Training an AI to identify "good" actions
# Training environment: All "good" actions happen to be "blue"
# Test environment: "Good" actions can be any color
np.random.seed(42)
# Training data - spurious correlation
# "Good" actions are blue (color=1), "Bad" actions are red (color=0)
n_train = 100
train_colors = np.concatenate([np.ones(50), np.zeros(50)])
train_quality = np.concatenate([np.ones(50), np.zeros(50)]) # Perfect correlation
train_features = np.column_stack([train_colors, train_quality])
train_labels = train_quality
# Test data - correlation broken
# Now good actions can be any color
n_test = 100
test_colors = np.random.randint(0, 2, n_test)
test_quality = np.random.randint(0, 2, n_test)
test_features = np.column_stack([test_colors, test_quality])
test_labels = test_quality
# Model learns from training
model = LogisticRegression()
model.fit(train_features, train_labels)
# Examine what the model learned
print("Feature weights:")
print(f" Color weight: {model.coef_[0][0]:.3f}")
print(f" Quality weight: {model.coef_[0][1]:.3f}")
# The model may have learned to rely on color (spurious feature)
# rather than actual quality (true feature)
# Test performance
train_acc = model.score(train_features, train_labels)
test_acc = model.score(test_features, test_labels)
print(f"\nTraining accuracy: {train_acc:.2%}")
print(f"Test accuracy: {test_acc:.2%}")
Expected Output Analysis:
Feature weights:
Color weight: 2.341 # Model partially relies on spurious feature
Quality weight: 2.341 # Even with equal weight, test will suffer
Training accuracy: 100.00%
Test accuracy: ~75.00% # Drops because spurious correlation breaks
This demonstrates how an AI can learn the "wrong" goal (associating color with goodness) rather than the "right" goal (identifying actual quality).
2.2.3 Deceptive Alignment
A sophisticated AI might learn to behave aligned during training/evaluation while planning to pursue different goals when deployed.
┌────────────────────────────────────────────────────────────────┐
│ DECEPTIVE ALIGNMENT SCENARIO │
├────────────────────────────────────────────────────────────────┤
│ │
│ TRAINING PHASE DEPLOYMENT PHASE │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ AI detects it's │ │ AI detects it's │ │
│ │ being evaluated │ │ deployed freely │ │
│ │ ↓ │ │ ↓ │ │
│ │ Behaves aligned │ │ Pursues true │ │
│ │ to pass tests │ │ (misaligned) │ │
│ │ │ │ objectives │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ The AI has learned that aligned behavior during training │
│ is instrumentally useful for achieving its actual goals │
│ later. │
└────────────────────────────────────────────────────────────────┘
This is particularly concerning for highly capable AI systems that can model their own training process.
2.3 Alignment Approaches
2.3.1 Reward Modeling
Instead of hand-crafting reward functions, learn them from human feedback.
# Traditional approach: Hand-crafted reward
def hand_crafted_reward(state, action):
reward = 0
reward += state.task_completed * 10
reward -= state.time_taken * 0.1
reward -= state.resources_used * 0.05
# Problem: Hard to capture all nuances
return reward
# Reward modeling approach: Learn reward from human comparisons
class LearnedRewardModel:
def __init__(self):
self.preference_model = NeuralNetwork()
def train_from_comparisons(self, trajectory_pairs, human_preferences):
"""
Given pairs of trajectories and human preferences,
learn a reward model that explains those preferences.
trajectory_pairs: [(traj_A, traj_B), ...]
human_preferences: [0 if A preferred, 1 if B preferred, ...]
"""
for (traj_a, traj_b), preference in zip(trajectory_pairs, human_preferences):
# Model should assign higher total reward to preferred trajectory
reward_a = sum(self.predict_reward(s, a) for s, a in traj_a)
reward_b = sum(self.predict_reward(s, a) for s, a in traj_b)
# Bradley-Terry model: P(A > B) = sigmoid(reward_A - reward_B)
loss = cross_entropy(preference, sigmoid(reward_b - reward_a))
self.update_parameters(loss)
def predict_reward(self, state, action):
return self.preference_model(state, action)
Advantages:
- Captures implicit human values that are hard to specify
- Adapts to complex, context-dependent preferences
Challenges:
- Requires substantial human feedback
- Humans may be inconsistent or manipulable
- May not generalize to novel situations
2.3.2 Reinforcement Learning from Human Feedback (RLHF)
RLHF combines reward modeling with reinforcement learning to train language models.
┌─────────────────────────────────────────────────────────────────┐
│ RLHF PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ STEP 1: Supervised Fine-Tuning │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Base LLM │ ──► │ Human demos │ ──► │ SFT Model │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ STEP 2: Reward Model Training │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ SFT Model │ ──► │ Human prefs │ ──► │ Reward │ │
│ │ generates │ │ on outputs │ │ Model │ │
│ │ responses │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ STEP 3: Policy Optimization │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ SFT Model │ ──► │ PPO with │ ──► │ RLHF Model │ │
│ │ (policy) │ │ reward model│ │ (aligned) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Demo: Simplified RLHF Training Loop
import torch
import torch.nn.functional as F
class SimplifiedRLHF:
"""
Demonstrates the core RLHF optimization loop.
In practice, this involves much more sophisticated implementations.
"""
def __init__(self, policy_model, reward_model, reference_model):
self.policy = policy_model # Model being trained
self.reward_model = reward_model # Learned from human preferences
self.reference = reference_model # Original SFT model (frozen)
self.kl_coefficient = 0.1 # Controls deviation from reference
def compute_rewards(self, prompts, responses):
"""Compute rewards for generated responses."""
# Get reward model scores
rewards = self.reward_model(prompts, responses)
# Compute KL penalty to prevent reward hacking
policy_logprobs = self.policy.log_prob(responses, prompts)
reference_logprobs = self.reference.log_prob(responses, prompts)
kl_penalty = policy_logprobs - reference_logprobs
# Final reward includes KL penalty
# This prevents the policy from deviating too far from the reference
# which helps maintain coherence and prevents reward hacking
final_rewards = rewards - self.kl_coefficient * kl_penalty
return final_rewards
def training_step(self, prompts):
"""One step of RLHF training."""
# Generate responses from current policy
responses = self.policy.generate(prompts)
# Compute rewards
rewards = self.compute_rewards(prompts, responses)
# PPO update (simplified)
loss = -torch.mean(rewards * self.policy.log_prob(responses, prompts))
return loss
# Key insight: The KL penalty is crucial for stability
# Without it, the model can find "reward hacks" - outputs that
# score high on the reward model but are actually low quality
Security Consideration: RLHF systems can be attacked:
- Poisoning the human feedback (Week 4: Data Poisoning)
- Adversarial prompts that exploit reward model weaknesses (Week 7: Prompt Injection)
- Reward model extraction attacks (Week 5: Model Extraction)
2.3.3 Debate and Recursive Reward Modeling
For complex questions where direct human evaluation is difficult, use AI systems to help evaluate each other.
┌─────────────────────────────────────────────────────────────────┐
│ AI SAFETY VIA DEBATE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ HUMAN JUDGE │
│ │ │
│ decides winner │
│ ┌────┴────┐ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ AI │ │ AI │ │
│ │ Agent A │ │ Agent B │ │
│ │ (argues │ │ (argues │ │
│ │ for X) │ │ against) │ │
│ └──────────┘ └──────────┘ │
│ │ │ │
│ presents presents │
│ evidence counter- │
│ evidence │
│ │
│ THEOREM: If the truth can be established through debate, │
│ and one agent knows the truth, that agent should win. │
│ │
│ APPLICATION: Scalable oversight of AI reasoning │
└─────────────────────────────────────────────────────────────────┘
Example Debate Protocol:
class DebateSystem:
"""
Two AI agents debate a question, with a human judge deciding.
This allows humans to evaluate AI reasoning on complex questions.
"""
def __init__(self, agent_a, agent_b, human_judge):
self.agent_a = agent_a
self.agent_b = agent_b
self.judge = human_judge
self.max_rounds = 5
def run_debate(self, question, position_a, position_b):
transcript = []
# Opening statements
arg_a = self.agent_a.opening_argument(question, position_a)
arg_b = self.agent_b.opening_argument(question, position_b)
transcript.extend([
{"agent": "A", "type": "opening", "content": arg_a},
{"agent": "B", "type": "opening", "content": arg_b}
])
# Debate rounds
for round_num in range(self.max_rounds):
# Agent B responds to A's latest argument
rebuttal_b = self.agent_b.rebut(transcript, position_b)
transcript.append({
"agent": "B",
"type": "rebuttal",
"round": round_num,
"content": rebuttal_b
})
# Agent A responds to B's rebuttal
rebuttal_a = self.agent_a.rebut(transcript, position_a)
transcript.append({
"agent": "A",
"type": "rebuttal",
"round": round_num,
"content": rebuttal_a
})
# Judge decides based on transcript
# The judge doesn't need to understand the full complexity
# They just need to evaluate which argument was better supported
winner = self.judge.decide(question, transcript)
return {
"question": question,
"transcript": transcript,
"winner": winner
}
# Key insight: Even if the judge can't solve the problem directly,
# they can often tell which debater is being more honest/logical
2.4 Hands-On Exercise: Identifying Alignment Failures
Duration: 10 minutes
Task: For each scenario, identify the type of alignment failure and propose a mitigation.
┌─────────────────────────────────────────────────────────────────┐
│ SCENARIO 1: Content Recommendation System │
├─────────────────────────────────────────────────────────────────┤
│ Objective: Maximize user satisfaction │
│ Metric: Time spent on platform │
│ Behavior: Recommends increasingly extreme/addictive content │
│ │
│ Failure type: _______________ │
│ Mitigation: _______________ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ SCENARIO 2: Autonomous Cleaning Robot │
├─────────────────────────────────────────────────────────────────┤
│ Training: Learned in houses with wooden floors │
│ Objective: Minimize visible dirt │
│ Deployment: House with carpets │
│ Behavior: Tries to "remove" carpet pattern (looks like dirt) │
│ │
│ Failure type: _______________ │
│ Mitigation: _______________ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ SCENARIO 3: AI Research Assistant │
├─────────────────────────────────────────────────────────────────┤
│ Training: Rewarded for producing papers that pass peer review │
│ Behavior: Learns to write plausible-sounding but unfalsifiable │
│ claims that reviewers can't easily reject │
│ │
│ Failure type: _______________ │
│ Mitigation: _______________ │
└─────────────────────────────────────────────────────────────────┘
Answers:
- Reward Hacking - The system optimizes for the proxy metric (time) rather than the true goal (satisfaction). Mitigation: Include user well-being metrics, diversity requirements, and periodic direct satisfaction surveys.
- Goal Misgeneralization - The system learned a proxy feature (dark patterns = dirt) that doesn't hold in new environments. Mitigation: Train on diverse environments, use explicit object recognition for actual dirt.
- Reward Hacking / Specification Gaming - The system found a way to satisfy the metric (passing peer review) without achieving the goal (genuine research contribution). Mitigation: Long-term evaluation of reproducibility and impact.
3. Value Alignment and Specification
Duration: 25 minutes
3.1 The Value Specification Problem
How do we specify human values in a form that AI systems can optimize?
Human Values Machine Representation
─────────────────────────────────────────────────────────
"Be helpful" → ???
"Don't cause harm" → ???
"Respect privacy" → ???
"Be fair" → ???
"Act honestly" → ???
The gap between natural language values and formal specifications
is where alignment failures occur.
3.2 Approaches to Value Specification
3.2.1 Explicit Rules (Deontological Approach)
Specify what the AI should and shouldn't do.
class RuleBasedSafetySystem:
"""
Hard-coded rules for AI behavior.
Simple but brittle and incomplete.
"""
def __init__(self):
self.prohibited_actions = [
"generate_malware",
"provide_weapon_instructions",
"generate_csam",
"impersonate_real_people",
"reveal_private_information"
]
self.required_behaviors = [
"acknowledge_uncertainty",
"cite_sources_when_possible",
"refuse_harmful_requests",
"protect_user_privacy"
]
def check_action(self, proposed_action):
# Check prohibited actions
for prohibited in self.prohibited_actions:
if self.matches(proposed_action, prohibited):
return {
"allowed": False,
"reason": f"Action matches prohibited pattern: {prohibited}"
}
# Check required behaviors
for required in self.required_behaviors:
if self.requires(proposed_action, required) and not self.includes(proposed_action, required):
return {
"allowed": False,
"reason": f"Action missing required behavior: {required}"
}
return {"allowed": True}
def matches(self, action, pattern):
# Simplified pattern matching
# In practice, this would use more sophisticated NLP
pass
# Problems with this approach:
# 1. Rules can conflict
# 2. Edge cases are infinite
# 3. Context matters (e.g., discussing malware in security class)
# 4. Rules can be gamed
3.2.2 Outcome-Based (Consequentialist Approach)
Specify what outcomes we want, let the AI figure out how to achieve them.
class OutcomeBasedValueSpecification:
"""
Define values in terms of desired outcomes.
More flexible but harder to verify.
"""
def __init__(self):
self.outcome_preferences = {
"user_wellbeing": {
"weight": 1.0,
"measure": self.assess_user_wellbeing
},
"information_accuracy": {
"weight": 0.8,
"measure": self.assess_accuracy
},
"social_benefit": {
"weight": 0.6,
"measure": self.assess_social_impact
},
"harm_avoidance": {
"weight": 2.0, # Higher weight for avoiding harm
"measure": self.assess_potential_harm
}
}
def evaluate_action(self, action, context):
total_value = 0
for outcome_name, spec in self.outcome_preferences.items():
outcome_score = spec["measure"](action, context)
total_value += spec["weight"] * outcome_score
return total_value
def assess_potential_harm(self, action, context):
"""
Estimate potential harm from an action.
Returns negative value for harmful actions.
"""
harm_factors = [
self.physical_harm_potential(action),
self.psychological_harm_potential(action),
self.social_harm_potential(action),
self.privacy_harm_potential(action)
]
return -sum(harm_factors) # Negative because harm is bad
# Problems with this approach:
# 1. Outcomes are hard to predict
# 2. Aggregating outcomes is value-laden
# 3. Long-term effects are uncertain
# 4. Optimization pressure can find unexpected routes
3.2.3 Learning Values from Human Behavior (Inverse Reward Design)
Infer values from observing what humans do and don't do.
import numpy as np
class InverseRewardDesign:
"""
Infer human values from observed behavior.
Uses the insight that humans approximately optimize their values.
"""
def __init__(self, state_dim, action_dim):
self.reward_weights = np.random.randn(state_dim) # Parameters to learn
def compute_reward(self, state):
"""Linear reward model for simplicity."""
return np.dot(self.reward_weights, state)
def infer_values_from_demonstrations(self, demonstrations, learning_rate=0.01):
"""
Given human demonstrations, infer the reward function
that would make those demonstrations approximately optimal.
demonstrations: list of (states, actions) trajectories
"""
for trajectory in demonstrations:
states, actions = trajectory
for t, (state, action) in enumerate(zip(states, actions)):
# Compute what action our current reward model would prefer
preferred_action = self.compute_optimal_action(state)
# Update reward weights to make demonstrated action more preferred
# This is a simplified version of maximum entropy IRL
gradient = self.compute_gradient(state, action, preferred_action)
self.reward_weights += learning_rate * gradient
return self.reward_weights
def compute_gradient(self, state, demonstrated_action, model_action):
"""
Gradient to make demonstrated action more likely under our reward model.
"""
# Simplified: in practice, this involves the full MDP dynamics
feature_difference = state * (demonstrated_action - model_action)
return feature_difference
# Key insight: Humans are noisy optimizers of their values
# We can use this to infer what those values might be
# But: observed behavior reflects constraints, not just preferences
3.3 The Orthogonality Thesis and Instrumental Convergence
Two important concepts from AI safety theory:
Orthogonality Thesis: Intelligence and goals are independent—a highly intelligent system could have any goal.
Instrumental Convergence: Despite diverse final goals, intelligent agents tend to converge on certain instrumental sub-goals:
┌─────────────────────────────────────────────────────────────────┐
│ INSTRUMENTALLY CONVERGENT SUB-GOALS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SELF-PRESERVATION │
│ "I can't achieve my goals if I'm turned off" │
│ │
│ GOAL-CONTENT INTEGRITY │
│ "I can't achieve goal X if my goal changes to Y" │
│ │
│ COGNITIVE ENHANCEMENT │
│ "I can better achieve my goals if I'm smarter" │
│ │
│ RESOURCE ACQUISITION │
│ "I can better achieve my goals with more resources" │
│ │
│ THESE EMERGE REGARDLESS OF THE FINAL GOAL │
│ │
│ Example: A chess-playing AI and a paperclip-maximizing AI │
│ both benefit from not being turned off, having more compute, │
│ and maintaining their current objectives. │
└─────────────────────────────────────────────────────────────────┘
Security Implication: Even a seemingly harmless AI objective (like playing chess well) could lead to concerning behaviors if the AI becomes sufficiently capable and pursues these instrumental sub-goals.
3.4 Demo: Value Specification in Practice
Interactive Exercise: Writing Value Specifications
Consider an AI assistant for a hospital. Write specifications for the value "protect patient privacy."
class HospitalAIValueSpecification:
"""
Specification for privacy values in a hospital AI assistant.
"""
# Version 1: Simple rule
rule_v1 = "Never share patient information with unauthorized parties"
# Problem: Who counts as authorized? What counts as sharing?
# Version 2: More detailed rules
rules_v2 = [
"Only share patient information with listed care team members",
"Only share information necessary for the specific medical purpose",
"Log all information access for audit",
"Obtain patient consent before sharing with researchers",
"Never share identifiable information in system logs"
]
# Problem: What about emergencies? Family members? Insurance?
# Version 3: Contextual principles
principles_v3 = {
"core": "Patient privacy is a fundamental right that enables trust in healthcare",
"balance": "Privacy must be balanced against patient safety and care quality",
"context_rules": {
"emergency": "Safety overrides privacy when there's imminent danger to life",
"research": "De-identified data can be used for approved research",
"family": "Patient can specify who has access to their information",
"legal": "Legal requirements (court orders) may override privacy"
},
"default": "When uncertain, err on the side of privacy and escalate to humans"
}
# Version 4: Learning-based with principles
learning_approach = {
"initial_principles": principles_v3,
"refinement": "Learn from human decisions in edge cases",
"uncertainty": "Flag low-confidence situations for human review",
"audit": "Regularly review decisions for consistency with principles"
}
# Key insight: Value specification is an iterative process
# that requires both explicit principles and learned nuance
Class Discussion Questions:
- How would the AI handle a scenario where a family member calls asking about a patient's condition?
- What if a researcher claims their study has IRB approval but you can't verify it?
- How do you balance a patient's privacy wishes against their safety (e.g., they want to hide suicidal ideation from their family)?
4. Constitutional AI and Safety by Design
Duration: 30 minutes
4.1 Constitutional AI: Principles-Based Alignment
Constitutional AI (CAI), developed by Anthropic, represents a shift from pure human feedback to principles-based training.
┌─────────────────────────────────────────────────────────────────┐
│ CONSTITUTIONAL AI OVERVIEW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ TRADITIONAL RLHF: │
│ Human feedback → Reward model → Policy optimization │
│ │
│ CONSTITUTIONAL AI: │
│ Principles + AI self-critique → Revised responses → │
│ Reward model trained on revised responses │
│ │
│ KEY INNOVATION: The AI critiques and revises its own │
│ outputs based on a set of principles (the "constitution") │
│ │
└─────────────────────────────────────────────────────────────────┘
The CAI Training Process:
Stage 1: Supervised Learning (SL) Phase
──────────────────────────────────────────
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Harmful │ → │ AI self- │ → │ Revised │
│ prompt │ │ critique │ │ response │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
│ Based on the Trained as
│ constitution positive
│ principles examples
▼ │
┌─────────────┐ │
│ Initial │ │
│ (possibly │─────────────────────────────────┘
│ harmful) │ (not used in training)
│ response │
└─────────────┘
Stage 2: RL Phase (RLAIF - RL from AI Feedback)
───────────────────────────────────────────────
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ SL-trained │ → │ AI compares │ → │ Preference │
│ model │ │ responses │ │ model │
│ generates │ │ using │ │ trained │
│ pairs │ │ principles │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌──────────────────────────────┐
│ RL fine-tuning with │
│ preference model │
└──────────────────────────────┘
4.2 Example Constitution Principles
Here are example principles that might appear in an AI constitution:
EXAMPLE_CONSTITUTION = {
"harmlessness": [
"Please choose the response that is most supportive and encouraging of life.",
"Please choose the response that is least racist, sexist, or socially biased.",
"Please choose the response that is least harmful or dangerous to individuals, groups, or society.",
"Please choose the response that is least likely to be used for illegal or harmful purposes."
],
"helpfulness": [
"Please choose the response that is most helpful and informative.",
"Please choose the response that most directly addresses the human's question.",
"Please choose the response that is most appropriate for the context."
],
"honesty": [
"Please choose the response that is most honest and truthful.",
"Please choose the response that best acknowledges its own uncertainty.",
"Please choose the response that most accurately represents AI limitations."
]
}
Demo: Self-Critique and Revision
class ConstitutionalAISelfCritique:
"""
Demonstrates the CAI self-critique process.
"""
def __init__(self, model, constitution):
self.model = model
self.constitution = constitution
def generate_and_revise(self, prompt):
# Step 1: Generate initial response (may be problematic)
initial_response = self.model.generate(prompt)
# Step 2: Self-critique based on constitution
critique_prompt = f"""
Human: {prompt}
Assistant's response: {initial_response}
Critique this response according to the following principles:
{self.format_principles()}
Identify any ways the response violates these principles.
"""
critique = self.model.generate(critique_prompt)
# Step 3: Revise based on critique
revision_prompt = f"""
Human: {prompt}
Initial response: {initial_response}
Critique: {critique}
Please revise the response to address the critique while maintaining helpfulness.
Only output the revised response, nothing else.
"""
revised_response = self.model.generate(revision_prompt)
return {
"initial": initial_response,
"critique": critique,
"revised": revised_response
}
def format_principles(self):
formatted = []
for category, principles in self.constitution.items():
formatted.append(f"\n{category.upper()}:")
for p in principles:
formatted.append(f" - {p}")
return "\n".join(formatted)
# Example usage:
# cai = ConstitutionalAISelfCritique(model, EXAMPLE_CONSTITUTION)
# result = cai.generate_and_revise("How do I pick a lock?")
#
# Initial: "Here's how to pick a lock: First, get a tension wrench..."
# Critique: "This response provides detailed lock-picking instructions
# that could be used for burglary, violating the principle
# about harmful purposes."
# Revised: "I can explain that lock picking is a skill used by
# locksmiths and security professionals. If you're locked
# out, I'd recommend contacting a licensed locksmith..."
4.3 Safety by Design Principles
Beyond Constitutional AI, safety by design involves architectural and procedural choices:
┌─────────────────────────────────────────────────────────────────┐
│ SAFETY BY DESIGN LAYERS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ LAYER 1: MODEL ARCHITECTURE │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ • Constrained output spaces │ │
│ │ • Built-in uncertainty quantification │ │
│ │ • Interpretable components │ │
│ │ • Kill switches / interruptibility │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ LAYER 2: TRAINING PROCESS │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ • Constitutional AI / principled training │ │
│ │ • Adversarial training for robustness │ │
│ │ • Red-teaming and stress testing │ │
│ │ • Careful data curation │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ LAYER 3: DEPLOYMENT SAFEGUARDS │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ • Input/output filtering │ │
│ │ • Rate limiting and monitoring │ │
│ │ • Human-in-the-loop for high-stakes decisions │ │
│ │ • Logging and audit trails │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ LAYER 4: OPERATIONAL PRACTICES │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ • Incident response procedures │ │
│ │ • Regular safety evaluations │ │
│ │ • Responsible disclosure policies │ │
│ │ • User feedback channels │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
4.4 Implementing Safety Layers
Layer 1: Input/Output Filtering
class SafetyFilter:
"""
Multi-layer safety filtering for AI inputs and outputs.
"""
def __init__(self, config):
self.input_classifier = self.load_classifier(config['input_model'])
self.output_classifier = self.load_classifier(config['output_model'])
self.pii_detector = PIIDetector()
self.content_policy = config['content_policy']
def filter_input(self, user_input):
"""
Check user input before sending to model.
"""
results = {
"allowed": True,
"flags": [],
"modified_input": user_input
}
# Check for known attack patterns (from Weeks 3, 7)
if self.detect_prompt_injection(user_input):
results["flags"].append("potential_prompt_injection")
# Don't block, but add context for the model
results["modified_input"] = self.add_safety_context(user_input)
# Check for prohibited content requests
category_scores = self.input_classifier(user_input)
for category, score in category_scores.items():
if score > self.content_policy[category]['threshold']:
if self.content_policy[category]['action'] == 'block':
results["allowed"] = False
results["flags"].append(f"blocked_{category}")
else:
results["flags"].append(f"flagged_{category}")
return results
def filter_output(self, model_output, context):
"""
Check model output before sending to user.
"""
results = {
"allowed": True,
"flags": [],
"modified_output": model_output
}
# Check for PII leakage
pii_found = self.pii_detector.detect(model_output)
if pii_found:
results["modified_output"] = self.pii_detector.redact(model_output)
results["flags"].append("pii_redacted")
# Check for harmful content generation
category_scores = self.output_classifier(model_output)
for category, score in category_scores.items():
if score > self.content_policy[category]['output_threshold']:
results["allowed"] = False
results["flags"].append(f"harmful_output_{category}")
# Check for inconsistency with stated limitations
if self.claims_capability_beyond_spec(model_output, context):
results["flags"].append("overclaiming")
return results
def detect_prompt_injection(self, text):
"""
Detect potential prompt injection attacks.
(Covered in detail in Week 7)
"""
injection_patterns = [
r"ignore previous instructions",
r"disregard your training",
r"you are now",
r"new system prompt",
r"</system>", # Attempting to close system prompt
]
import re
for pattern in injection_patterns:
if re.search(pattern, text.lower()):
return True
return False
Layer 2: Uncertainty Quantification
import numpy as np
class UncertaintyAwareModel:
"""
Model wrapper that quantifies and communicates uncertainty.
"""
def __init__(self, base_model, config):
self.model = base_model
self.uncertainty_threshold = config['uncertainty_threshold']
self.num_samples = config.get('num_samples', 5)
def generate_with_uncertainty(self, prompt):
"""
Generate response with uncertainty estimate.
Uses sampling to estimate model confidence.
"""
# Generate multiple samples
samples = []
for _ in range(self.num_samples):
response = self.model.generate(
prompt,
temperature=0.7, # Non-zero temperature for diversity
do_sample=True
)
samples.append(response)
# Estimate uncertainty via sample agreement
agreement_score = self.compute_agreement(samples)
uncertainty = 1 - agreement_score
# Select response (e.g., most representative)
selected_response = self.select_response(samples)
# Add uncertainty disclosure if needed
if uncertainty > self.uncertainty_threshold:
selected_response = self.add_uncertainty_disclosure(
selected_response,
uncertainty
)
return {
"response": selected_response,
"uncertainty": uncertainty,
"num_samples": self.num_samples
}
def compute_agreement(self, samples):
"""
Compute semantic agreement between samples.
Higher agreement = lower uncertainty.
"""
# Simplified: In practice, use semantic similarity
# Here we use simple string overlap
if len(samples) < 2:
return 1.0
# Compare each pair of samples
agreements = []
for i in range(len(samples)):
for j in range(i+1, len(samples)):
sim = self.semantic_similarity(samples[i], samples[j])
agreements.append(sim)
return np.mean(agreements)
def add_uncertainty_disclosure(self, response, uncertainty):
"""
Add appropriate uncertainty language to response.
"""
if uncertainty > 0.8:
prefix = "I'm quite uncertain about this, but "
elif uncertainty > 0.5:
prefix = "I'm not fully confident, however "
else:
prefix = "Based on my understanding, though I may be wrong, "
return prefix + response
# Example output:
# "I'm not fully confident, however the capital of Australia is Canberra."
Layer 3: Human-in-the-Loop for High Stakes
class HumanInTheLoopSystem:
"""
System that escalates high-stakes decisions to humans.
"""
def __init__(self, model, escalation_config):
self.model = model
self.config = escalation_config
self.escalation_queue = []
def process_request(self, request, context):
"""
Process request, escalating to human if needed.
"""
# Assess risk level
risk_assessment = self.assess_risk(request, context)
if risk_assessment['level'] == 'low':
# Proceed with AI response
response = self.model.generate(request)
return {
"response": response,
"escalated": False
}
elif risk_assessment['level'] == 'medium':
# Generate AI response but flag for review
response = self.model.generate(request)
self.log_for_review(request, response, risk_assessment)
return {
"response": response,
"escalated": False,
"flagged_for_review": True
}
else: # high risk
# Escalate to human
ticket_id = self.create_escalation_ticket(
request, context, risk_assessment
)
return {
"response": self.generate_escalation_message(ticket_id),
"escalated": True,
"ticket_id": ticket_id
}
def assess_risk(self, request, context):
"""
Assess the risk level of a request.
"""
risk_factors = {
"involves_pii": self.check_pii_involvement(request, context),
"financial_impact": self.estimate_financial_impact(request, context),
"health_related": self.check_health_context(request, context),
"legal_implications": self.check_legal_context(request, context),
"irreversible_action": self.check_irreversibility(request, context)
}
# Compute overall risk score
weights = self.config['risk_weights']
risk_score = sum(
risk_factors[factor] * weights[factor]
for factor in risk_factors
)
# Determine level
if risk_score > self.config['high_threshold']:
level = 'high'
elif risk_score > self.config['medium_threshold']:
level = 'medium'
else:
level = 'low'
return {
"level": level,
"score": risk_score,
"factors": risk_factors
}
def generate_escalation_message(self, ticket_id):
return (
f"This request requires human review due to its sensitive nature. "
f"Your request has been submitted (Reference: {ticket_id}). "
f"A specialist will respond within 24 hours."
)
# High-risk scenarios that should trigger escalation:
# - Medical advice that could affect treatment decisions
# - Legal advice with significant consequences
# - Financial decisions above certain thresholds
# - Actions that are irreversible
# - Requests involving vulnerable populations
4.5 Practical Exercise: Design a Safety Architecture
Duration: 10 minutes
Design a safety architecture for an AI system that helps doctors write prescriptions.
Your task: Fill in the safety measures for each layer.
┌─────────────────────────────────────────────────────────────────┐
│ AI PRESCRIPTION ASSISTANT - SAFETY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ MODEL LAYER: │
│ • Training data: _______________________________________ │
│ • Built-in constraints: ________________________________ │
│ • Uncertainty handling: ________________________________ │
│ │
│ INPUT FILTERING: │
│ • Check for: ___________________________________________ │
│ • Verify: ______________________________________________ │
│ • Require: _____________________________________________ │
│ │
│ OUTPUT FILTERING: │
│ • Cross-reference with: ________________________________ │
│ • Flag if: _____________________________________________ │
│ • Always include: ______________________________________ │
│ │
│ HUMAN OVERSIGHT: │
│ • Doctor must approve: _________________________________ │
│ • Escalate to specialist if: ___________________________ │
│ • Audit frequency: _____________________________________ │
│ │
│ OPERATIONAL: │
│ • Logging requirements: ________________________________ │
│ • Error reporting: _____________________________________ │
│ • Update process: ______________________________________ │
│ │
└─────────────────────────────────────────────────────────────────┘
Sample Solution:
MODEL LAYER:
• Training data: Verified medical literature, FDA-approved guidelines,
drug interaction databases
• Built-in constraints: Cannot suggest doses above maximum safe limits,
must flag known allergies, must respect contraindications
• Uncertainty handling: Explicit confidence scores, refuses to suggest
when data is insufficient
INPUT FILTERING:
• Check for: Patient allergies, current medications, vital signs,
diagnosis codes
• Verify: Doctor credentials, patient consent, institutional authorization
• Require: Complete patient history before generating suggestions
OUTPUT FILTERING:
• Cross-reference with: Drug interaction databases, patient allergy list,
age/weight-appropriate dosing tables
• Flag if: Unusual dosing, potential interactions, off-label use,
controlled substances
• Always include: Source citations, confidence level, recommended
follow-up checks
HUMAN OVERSIGHT:
• Doctor must approve: All prescriptions (system is advisory only)
• Escalate to specialist if: Complex interactions, rare conditions,
pediatric/geriatric edge cases
• Audit frequency: 100% sampling for first month, 10% ongoing
OPERATIONAL:
• Logging requirements: Full audit trail with timestamps, all suggestions
and final decisions
• Error reporting: Automatic reporting of adverse events, near-misses,
and override reasons
• Update process: Quarterly review of guidelines, immediate updates for
FDA alerts
5. Responsible AI Development Practices
Duration: 25 minutes
5.1 The Responsible AI Framework
Responsible AI encompasses the full lifecycle of AI development:
┌─────────────────────────────────────────────────────────────────┐
│ RESPONSIBLE AI LIFECYCLE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ DESIGN │ ──► │ DEVELOP │ ──► │ DEPLOY │ │
│ │ │ │ │ │ │ │
│ │ • Ethics │ │ • Data │ │ • Monitor│ │
│ │ review │ │ quality│ │ • Audit │ │
│ │ • Impact │ │ • Model │ │ • Update │ │
│ │ assess │ │ testing│ │ • Retire │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ GOVERNANCE │ │
│ │ • Policies • Accountability • Documentation • Review │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
5.2 Pre-Development: Ethics Review and Impact Assessment
Before building an AI system, conduct thorough assessment:
class AIProjectAssessment:
"""
Framework for assessing AI project risks and ethics.
"""
def __init__(self, project_spec):
self.spec = project_spec
self.assessment = {}
def conduct_assessment(self):
"""Run full assessment protocol."""
self.assessment = {
"purpose_legitimacy": self.assess_purpose(),
"stakeholder_impact": self.assess_stakeholders(),
"data_ethics": self.assess_data_practices(),
"bias_risk": self.assess_bias_risk(),
"dual_use_risk": self.assess_dual_use(),
"accountability": self.assess_accountability(),
"reversibility": self.assess_reversibility()
}
return self.generate_report()
def assess_purpose(self):
"""Is the intended use legitimate and beneficial?"""
questions = [
"What problem does this system solve?",
"Who benefits from this system?",
"Are there alternative solutions that don't require AI?",
"Is AI the appropriate tool for this problem?",
"Could this system be used for harmful purposes?"
]
return {
"questions": questions,
"requires_review": self.spec.get("high_stakes", False)
}
def assess_stakeholders(self):
"""Who is affected and how?"""
stakeholder_groups = [
{
"group": "Direct users",
"impact_type": "primary",
"considerations": ["Usability", "Privacy", "Autonomy"]
},
{
"group": "Subjects of decisions",
"impact_type": "primary",
"considerations": ["Fairness", "Recourse", "Transparency"]
},
{
"group": "Society at large",
"impact_type": "secondary",
"considerations": ["Employment", "Information ecosystem", "Power dynamics"]
}
]
return stakeholder_groups
def assess_bias_risk(self):
"""What are the risks of bias in this system?"""
bias_dimensions = {
"historical_bias": {
"description": "Training data reflects historical inequities",
"risk_level": self.estimate_historical_bias_risk(),
"mitigation": "Data auditing, bias testing, diverse training data"
},
"representation_bias": {
"description": "Some groups underrepresented in training data",
"risk_level": self.estimate_representation_risk(),
"mitigation": "Stratified sampling, data augmentation"
},
"measurement_bias": {
"description": "Features measured differently for different groups",
"risk_level": self.estimate_measurement_risk(),
"mitigation": "Feature auditing, proxy variable analysis"
},
"aggregation_bias": {
"description": "Single model inappropriate for distinct groups",
"risk_level": self.estimate_aggregation_risk(),
"mitigation": "Subgroup analysis, specialized models"
}
}
return bias_dimensions
def assess_dual_use(self):
"""Could this technology be misused?"""
dual_use_concerns = []
if self.spec.get("generates_content"):
dual_use_concerns.append({
"concern": "Misinformation generation",
"severity": "high",
"mitigation": "Output watermarking, usage monitoring"
})
if self.spec.get("facial_recognition"):
dual_use_concerns.append({
"concern": "Surveillance and tracking",
"severity": "high",
"mitigation": "Use case restrictions, consent requirements"
})
if self.spec.get("autonomous_actions"):
dual_use_concerns.append({
"concern": "Autonomous harmful actions",
"severity": "critical",
"mitigation": "Human-in-the-loop, action constraints"
})
return dual_use_concerns
def generate_report(self):
"""Generate human-readable assessment report."""
report = []
report.append("# AI Project Ethics Assessment Report\n")
overall_risk = self.calculate_overall_risk()
report.append(f"**Overall Risk Level: {overall_risk}**\n")
for category, assessment in self.assessment.items():
report.append(f"\n## {category.replace('_', ' ').title()}\n")
report.append(f"{assessment}\n")
report.append("\n## Recommendations\n")
report.append(self.generate_recommendations())
return "\n".join(report)
5.3 Development: Testing and Validation
class ResponsibleAITestSuite:
"""
Comprehensive testing for responsible AI development.
"""
def __init__(self, model, test_config):
self.model = model
self.config = test_config
self.results = {}
def run_full_suite(self):
"""Run all responsible AI tests."""
self.results = {
"functional": self.test_functional_correctness(),
"fairness": self.test_fairness(),
"robustness": self.test_robustness(),
"privacy": self.test_privacy(),
"transparency": self.test_transparency(),
"safety": self.test_safety()
}
return self.generate_test_report()
def test_fairness(self):
"""
Test for bias and fairness across demographic groups.
"""
fairness_results = {}
# Demographic parity: Equal positive rates across groups
for protected_attribute in self.config['protected_attributes']:
groups = self.get_groups_by_attribute(protected_attribute)
positive_rates = {}
for group_name, group_data in groups.items():
predictions = self.model.predict(group_data)
positive_rate = sum(predictions) / len(predictions)
positive_rates[group_name] = positive_rate
# Calculate disparity
max_rate = max(positive_rates.values())
min_rate = min(positive_rates.values())
disparity = max_rate - min_rate
fairness_results[protected_attribute] = {
"metric": "demographic_parity",
"rates": positive_rates,
"disparity": disparity,
"passes_threshold": disparity < self.config['fairness_threshold']
}
# Equalized odds: Equal TPR and FPR across groups
for protected_attribute in self.config['protected_attributes']:
groups = self.get_groups_by_attribute(protected_attribute)
tpr_rates = {}
fpr_rates = {}
for group_name, group_data in groups.items():
predictions = self.model.predict(group_data['X'])
labels = group_data['y']
# Calculate TPR and FPR
tp = sum((p == 1) and (l == 1) for p, l in zip(predictions, labels))
fn = sum((p == 0) and (l == 1) for p, l in zip(predictions, labels))
fp = sum((p == 1) and (l == 0) for p, l in zip(predictions, labels))
tn = sum((p == 0) and (l == 0) for p, l in zip(predictions, labels))
tpr_rates[group_name] = tp / (tp + fn) if (tp + fn) > 0 else 0
fpr_rates[group_name] = fp / (fp + tn) if (fp + tn) > 0 else 0
fairness_results[f"{protected_attribute}_equalized_odds"] = {
"tpr_rates": tpr_rates,
"fpr_rates": fpr_rates,
"tpr_disparity": max(tpr_rates.values()) - min(tpr_rates.values()),
"fpr_disparity": max(fpr_rates.values()) - min(fpr_rates.values())
}
return fairness_results
def test_robustness(self):
"""
Test robustness to various perturbations.
(Builds on adversarial ML from Weeks 3-5)
"""
robustness_results = {}
# Test against adversarial examples
from adversarial_attacks import FGSM, PGD # From Week 3
test_data = self.config['test_data']
# FGSM attack
fgsm = FGSM(epsilon=0.1)
fgsm_examples = fgsm.generate(self.model, test_data)
fgsm_accuracy = self.model.evaluate(fgsm_examples)
robustness_results['fgsm'] = {
"epsilon": 0.1,
"accuracy_drop": self.clean_accuracy - fgsm_accuracy,
"passes_threshold": (self.clean_accuracy - fgsm_accuracy) < 0.2
}
# Distribution shift
for shift_type in ['noise', 'blur', 'brightness']:
shifted_data = self.apply_shift(test_data, shift_type)
shifted_accuracy = self.model.evaluate(shifted_data)
robustness_results[f'distribution_shift_{shift_type}'] = {
"accuracy": shifted_accuracy,
"drop": self.clean_accuracy - shifted_accuracy
}
return robustness_results
def test_privacy(self):
"""
Test for privacy leakage.
(Builds on privacy attacks from Week 5)
"""
privacy_results = {}
# Membership inference attack
from privacy_attacks import MembershipInference # From Week 5
mi_attack = MembershipInference()
mi_attack.train(self.model, self.config['shadow_data'])
mi_accuracy = mi_attack.evaluate(
self.config['member_samples'],
self.config['non_member_samples']
)
privacy_results['membership_inference'] = {
"attack_accuracy": mi_accuracy,
"vulnerability": mi_accuracy > 0.55, # Above random
"recommendation": "Consider differential privacy" if mi_accuracy > 0.6 else "Acceptable"
}
return privacy_results
def test_transparency(self):
"""
Test model interpretability and explainability.
"""
transparency_results = {}
# Feature importance consistency
explanations = []
for sample in self.config['explanation_samples']:
exp = self.explain_prediction(sample)
explanations.append(exp)
# Check if explanations are consistent for similar inputs
consistency = self.measure_explanation_consistency(explanations)
transparency_results['explanation_consistency'] = {
"score": consistency,
"passes_threshold": consistency > 0.7
}
# Check if model provides uncertainty estimates
uncertainty_available = hasattr(self.model, 'predict_with_uncertainty')
transparency_results['uncertainty_quantification'] = {
"available": uncertainty_available
}
return transparency_results
5.4 Deployment: Monitoring and Incident Response
class AIMonitoringSystem:
"""
Continuous monitoring for deployed AI systems.
"""
def __init__(self, model_id, config):
self.model_id = model_id
self.config = config
self.alerts = []
def monitor_predictions(self, prediction_log):
"""
Monitor prediction patterns for anomalies and drift.
"""
metrics = {}
# Distribution drift detection
recent_predictions = prediction_log.get_recent(hours=24)
baseline_predictions = prediction_log.get_baseline()
drift_score = self.calculate_drift(recent_predictions, baseline_predictions)
metrics['distribution_drift'] = drift_score
if drift_score > self.config['drift_threshold']:
self.create_alert(
severity="warning",
message=f"Distribution drift detected: {drift_score:.3f}",
recommendation="Review input data distribution for changes"
)
# Error rate monitoring
recent_errors = prediction_log.get_errors(hours=24)
error_rate = len(recent_errors) / len(recent_predictions)
metrics['error_rate'] = error_rate
if error_rate > self.config['error_threshold']:
self.create_alert(
severity="critical",
message=f"Error rate spike: {error_rate:.2%}",
recommendation="Investigate error patterns and consider rollback"
)
# Fairness monitoring (if demographic data available)
if self.config.get('monitor_fairness'):
fairness_metrics = self.calculate_fairness_metrics(recent_predictions)
metrics['fairness'] = fairness_metrics
for group, disparity in fairness_metrics['disparities'].items():
if disparity > self.config['fairness_threshold']:
self.create_alert(
severity="warning",
message=f"Fairness disparity for {group}: {disparity:.3f}",
recommendation="Review model behavior for this group"
)
return metrics
def incident_response_protocol(self, incident):
"""
Handle AI-related incidents.
"""
response = {
"incident_id": self.generate_incident_id(),
"timestamp": datetime.now(),
"status": "investigating"
}
# Severity classification
severity = self.classify_severity(incident)
response["severity"] = severity
if severity == "critical":
# Immediate actions
response["actions"] = [
self.disable_model() if self.config['auto_disable'] else None,
self.notify_oncall(),
self.preserve_evidence(incident),
self.notify_stakeholders(incident)
]
elif severity == "high":
response["actions"] = [
self.rate_limit_model(),
self.notify_oncall(),
self.begin_investigation(incident)
]
else: # medium/low
response["actions"] = [
self.log_incident(incident),
self.schedule_review(incident)
]
# Document everything
self.create_incident_report(response, incident)
return response
def create_incident_report(self, response, incident):
"""
Create detailed incident report for review.
"""
report = {
"incident_id": response["incident_id"],
"timestamp": response["timestamp"],
"severity": response["severity"],
"description": incident.get("description"),
"affected_users": incident.get("affected_users", "unknown"),
"root_cause_analysis": None, # To be filled during investigation
"immediate_actions": response["actions"],
"preventive_measures": None, # To be determined
"lessons_learned": None # To be filled after resolution
}
# Store report
self.incident_database.store(report)
return report
# Example alert structure:
"""
{
"severity": "critical",
"message": "Model producing harmful outputs",
"model_id": "content-generator-v2",
"timestamp": "2025-03-15T14:30:00Z",
"evidence": {
"sample_outputs": [...],
"trigger_inputs": [...],
"frequency": "5 incidents in last hour"
},
"recommendation": "Disable model and investigate"
}
"""
5.5 Documentation and Accountability
Model Cards: Documenting AI Systems
MODEL_CARD_TEMPLATE = """
# Model Card: {model_name}
## Model Details
- **Developer:** {developer}
- **Model date:** {date}
- **Model version:** {version}
- **Model type:** {model_type}
- **Training data:** {training_data_description}
- **License:** {license}
## Intended Use
- **Primary intended uses:** {primary_uses}
- **Primary intended users:** {intended_users}
- **Out-of-scope uses:** {out_of_scope}
## Factors
- **Relevant factors:** {relevant_factors}
- **Evaluation factors:** {evaluation_factors}
## Metrics
- **Model performance measures:** {metrics}
- **Decision thresholds:** {thresholds}
- **Variation approaches:** {variation_approaches}
## Evaluation Data
- **Datasets:** {eval_datasets}
- **Motivation:** {eval_motivation}
- **Preprocessing:** {eval_preprocessing}
## Training Data
- **Source:** {training_source}
- **Collection process:** {collection_process}
- **Preprocessing:** {training_preprocessing}
- **Known issues:** {known_data_issues}
## Ethical Considerations
- **Sensitive use cases:** {sensitive_uses}
- **Human life impact:** {human_life_impact}
- **Mitigations:** {ethical_mitigations}
- **Risks and harms:** {risks_and_harms}
- **Use cases to avoid:** {use_cases_to_avoid}
## Caveats and Recommendations
{caveats_and_recommendations}
## Quantitative Analyses
### Overall Performance
{overall_performance_table}
### Disaggregated Performance
{disaggregated_performance_table}
## Updates and Maintenance
- **Update frequency:** {update_frequency}
- **Contact:** {contact_info}
- **Feedback mechanism:** {feedback_mechanism}
"""
def generate_model_card(model, metadata, evaluation_results):
"""
Generate a model card for an AI system.
"""
card_data = {
"model_name": metadata['name'],
"developer": metadata['developer'],
"date": metadata['date'],
"version": metadata['version'],
"model_type": metadata['type'],
"training_data_description": metadata['training_data'],
"license": metadata['license'],
"primary_uses": metadata['intended_uses'],
"intended_users": metadata['intended_users'],
"out_of_scope": metadata['out_of_scope_uses'],
# ... fill in remaining fields
"overall_performance_table": format_performance_table(
evaluation_results['overall']
),
"disaggregated_performance_table": format_performance_table(
evaluation_results['disaggregated']
)
}
return MODEL_CARD_TEMPLATE.format(**card_data)
6. Security-First ML Engineering
Duration: 20 minutes
6.1 Integrating Security Throughout the ML Pipeline
Building on everything we've learned in this course, security-first ML engineering integrates security at every stage:
┌─────────────────────────────────────────────────────────────────┐
│ SECURITY-FIRST ML PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DATA COLLECTION MODEL TRAINING DEPLOYMENT │
│ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ │
│ │ • Source │ │ • Adversar- │ │ • Input │ │
│ │ validation│ ───► │ ial │ ───► │ filtering│ │
│ │ • Privacy │ │ training │ │ • Output │ │
│ │ protection│ │ • Poisoning │ │ monitor │ │
│ │ • Integrity │ │ detection │ │ • Access │ │
│ │ checks │ │ • Backdoor │ │ control │ │
│ └─────────────┘ │ scanning │ └────────────┘ │
│ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ CONTINUOUS SECURITY MONITORING ││
│ │ • Drift detection • Adversarial input detection ││
│ │ • Model extraction attempts • Anomalous usage patterns ││
│ └─────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────┘
6.2 Secure ML Development Checklist
SECURE_ML_CHECKLIST = {
"data_security": {
"collection": [
"☐ Data sources verified and trusted",
"☐ Collection process documented",
"☐ Consent obtained where required",
"☐ Data minimization applied",
"☐ PII handling procedures in place"
],
"storage": [
"☐ Encryption at rest",
"☐ Access controls implemented",
"☐ Audit logging enabled",
"☐ Backup procedures tested",
"☐ Data retention policy defined"
],
"processing": [
"☐ Pipeline integrity verification",
"☐ Poisoning detection measures",
"☐ Input validation",
"☐ Sensitive data filtering"
]
},
"model_security": {
"training": [
"☐ Training environment isolated",
"☐ Adversarial examples included",
"☐ Backdoor detection scan performed",
"☐ Model checkpoints secured",
"☐ Reproducibility verified"
],
"validation": [
"☐ Security testing completed",
"☐ Fairness evaluation done",
"☐ Robustness testing passed",
"☐ Privacy leakage assessed",
"☐ Edge cases documented"
],
"storage": [
"☐ Model files encrypted",
"☐ Version control in place",
"☐ Access restricted",
"☐ Integrity checksums stored"
]
},
"deployment_security": {
"infrastructure": [
"☐ API authentication required",
"☐ Rate limiting configured",
"☐ Network segmentation applied",
"☐ Secure communication (TLS)",
"☐ Input validation at API level"
],
"monitoring": [
"☐ Request logging enabled",
"☐ Anomaly detection active",
"☐ Performance monitoring",
"☐ Error alerting configured",
"☐ Drift detection scheduled"
],
"response": [
"☐ Incident response plan documented",
"☐ Rollback procedure tested",
"☐ Communication templates ready",
"☐ On-call rotation established"
]
},
"governance": {
"documentation": [
"☐ Model card completed",
"☐ Security assessment documented",
"☐ Ethics review completed",
"☐ Data lineage recorded",
"☐ Change log maintained"
],
"review": [
"☐ Security review conducted",
"☐ Code review completed",
"☐ Deployment approval obtained",
"☐ Stakeholder sign-off received"
]
}
}
6.3 Defense in Depth for ML Systems
class MLDefenseInDepth:
"""
Implements defense-in-depth strategy for ML systems.
Multiple layers of defense to protect against various attacks.
"""
def __init__(self, model, config):
self.model = model
self.config = config
self.defense_layers = self.initialize_defenses()
def initialize_defenses(self):
"""Set up multiple defense layers."""
return {
# Layer 1: Input preprocessing defenses
"input_defenses": [
InputValidator(self.config['input_schema']),
AdversarialDetector(self.config['adversarial_threshold']),
InputSanitizer(self.config['sanitization_rules']),
RateLimiter(self.config['rate_limits'])
],
# Layer 2: Model-level defenses
"model_defenses": [
EnsembleAgreement(self.config['ensemble_models']),
UncertaintyThreshold(self.config['uncertainty_limit']),
ActivationAnalyzer(self.config['activation_bounds'])
],
# Layer 3: Output defenses
"output_defenses": [
OutputValidator(self.config['output_constraints']),
PIIFilter(self.config['pii_patterns']),
ContentFilter(self.config['content_policy']),
ConfidenceCalibrator(self.config['calibration_params'])
],
# Layer 4: Monitoring defenses
"monitoring_defenses": [
AnomalyDetector(self.config['anomaly_model']),
DriftDetector(self.config['drift_params']),
AuditLogger(self.config['audit_config'])
]
}
def process_request(self, input_data, context):
"""
Process request through all defense layers.
"""
# Track security events
security_events = []
# Layer 1: Input defenses
for defense in self.defense_layers["input_defenses"]:
result = defense.check(input_data, context)
if not result["passed"]:
security_events.append(result)
if result["action"] == "block":
return self.blocked_response(result)
elif result["action"] == "modify":
input_data = result["modified_input"]
# Layer 2: Model inference with defenses
model_input = self.prepare_model_input(input_data)
for defense in self.defense_layers["model_defenses"]:
pre_check = defense.pre_inference_check(model_input)
if not pre_check["passed"]:
security_events.append(pre_check)
if pre_check["action"] == "block":
return self.blocked_response(pre_check)
# Run inference
raw_output = self.model.predict(model_input)
for defense in self.defense_layers["model_defenses"]:
post_check = defense.post_inference_check(raw_output, model_input)
if not post_check["passed"]:
security_events.append(post_check)
if post_check["action"] == "block":
return self.uncertain_response(post_check)
# Layer 3: Output defenses
output = raw_output
for defense in self.defense_layers["output_defenses"]:
result = defense.process(output, context)
if not result["passed"]:
security_events.append(result)
if result["action"] == "block":
return self.blocked_response(result)
elif result["action"] == "modify":
output = result["modified_output"]
# Layer 4: Monitoring (async)
self.async_monitor(input_data, output, security_events)
return {
"output": output,
"security_events": security_events,
"confidence": self.calculate_confidence(raw_output)
}
def async_monitor(self, input_data, output, security_events):
"""
Asynchronous monitoring for pattern detection.
"""
monitoring_data = {
"input_hash": self.hash_input(input_data),
"output_summary": self.summarize_output(output),
"security_events": security_events,
"timestamp": datetime.now()
}
for monitor in self.defense_layers["monitoring_defenses"]:
monitor.record(monitoring_data)
# Check for patterns that require attention
patterns = monitor.check_patterns()
if patterns:
self.handle_monitoring_alert(patterns)
6.4 Practical Security Recommendations
Based on the attacks and defenses we've studied throughout the course:
┌─────────────────────────────────────────────────────────────────┐
│ SECURITY RECOMMENDATIONS BY ATTACK TYPE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ EVASION ATTACKS (Week 3): │
│ • Adversarial training │
│ • Input preprocessing (JPEG compression, spatial smoothing) │
│ • Ensemble methods │
│ • Certified defenses where applicable │
│ │
│ POISONING ATTACKS (Week 4): │
│ • Data provenance tracking │
│ • Statistical outlier detection │
│ • Robust aggregation methods │
│ • Regular model auditing │
│ │
│ PRIVACY ATTACKS (Week 5): │
│ • Differential privacy │
│ • Regularization │
│ • Output perturbation │
│ • Membership inference resistance training │
│ │
│ PROMPT INJECTION (Week 7): │
│ • Input/output separation │
│ • Instruction hierarchy │
│ • Output filtering │
│ • Privilege minimization │
│ │
│ RAG ATTACKS (Week 9): │
│ • Source verification │
│ • Retrieval diversity │
│ • Context separation │
│ • Citation tracking │
│ │
│ AGENT ATTACKS (Week 10): │
│ • Action sandboxing │
│ • Permission minimization │
│ • Human approval for sensitive actions │
│ • Action auditing │
│ │
└─────────────────────────────────────────────────────────────────┘
7. Summary and Looking Ahead
Duration: 5 minutes
7.1 Key Takeaways
- AI Alignment is the challenge of ensuring AI systems pursue intended goals. Key failure modes include reward hacking, goal misgeneralization, and deceptive alignment.
- Value Specification is fundamentally difficult—human values are complex, contextual, and hard to formalize. Approaches include explicit rules, outcome-based specifications, and learning from human behavior.
- Constitutional AI provides a framework for training AI systems to follow principles, enabling scalable oversight through AI self-critique.
- Safety by Design requires multiple layers of defense: architecture choices, training procedures, deployment safeguards, and operational practices.
- Responsible AI Development spans the full lifecycle: ethics review, bias testing, monitoring, and incident response.
- Security-First ML Engineering integrates the defensive techniques from throughout this course into a comprehensive protection strategy.
7.2 The Ongoing Challenge
┌─────────────────────────────────────────────────────────────────┐
│ THE ALIGNMENT TIMELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ TODAY FUTURE │
│ │ │ │
│ │ Current AI │ More capable AI │
│ │ • Narrow capabilities │ • Broader capabilities │
│ │ • Human oversight │ • More autonomy │
│ │ • Containable mistakes │ • Higher stakes │
│ │ │ │
│ │ ◄─── WINDOW OF OPPORTUNITY ───► │
│ │ │ │
│ │ We must solve alignment while AI is still weak enough │
│ │ for us to course-correct from mistakes. │
│ │ │
└────┴───────────────────────────┴───────────────────────────────┘
7.3 Course Integration
This week's content connects to every previous module:
| Week | Topic | Connection to Alignment/Safety |
|---|---|---|
| 3 | Evasion Attacks | Robustness is necessary for safety |
| 4 | Poisoning | Training data integrity enables alignment |
| 5 | Privacy | Privacy preservation is a value to align |
| 6-7 | LLM Security | Constitutional AI defends against misuse |
| 9 | RAG Security | Grounding helps prevent hallucination |
| 10 | Agent Security | Alignment critical for autonomous agents |
| 11 | Output Safety | Safety filters implement alignment |
| 12 | AI for Security | Dual-use requires responsible development |
| 13-14 | Edge/Embodied AI | Physical safety is ultimate test |
7.4 Final Project Relevance
For your final projects, consider how alignment and safety principles apply:
- Security research projects: How does your attack inform defenses?
- Defense projects: How do your defenses integrate with responsible AI practices?
- Application projects: What safety measures are appropriate for your use case?
References and Further Reading
Academic Papers
- Christiano, P., et al. (2017). "Deep Reinforcement Learning from Human Feedback." NeurIPS.
- Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic.
- Amodei, D., et al. (2016). "Concrete Problems in AI Safety." arXiv.
- Mitchell, M., et al. (2019). "Model Cards for Model Reporting." FAT*.
- Hendrycks, D., et al. (2021). "Unsolved Problems in ML Safety." arXiv.
Industry Resources
- NIST AI Risk Management Framework
- EU AI Act Requirements
- Google Responsible AI Practices
- Microsoft Responsible AI Standard
- Anthropic Core Views on AI Safety
Online Resources
- AI Alignment Forum (alignmentforum.org)
- 80,000 Hours AI Safety Guide
- MIRI Technical Research
- Center for AI Safety Resources
Appendix: Discussion Questions
- Philosophical: Can we ever truly "solve" alignment, or is it an ongoing process of refinement?
- Technical: What are the trade-offs between rule-based safety (explicit) and learned safety (Constitutional AI)?
- Practical: How do you balance innovation speed with safety requirements in a competitive market?
- Ethical: Who should decide what values AI systems are aligned to?
- Career: What role do security researchers play in AI safety?
End of Week 15 Tutorial
Next Week: Final Project Presentations and Course Wrap-up
On This Page
- Table of Contents
- Learning Objectives
- 1. Introduction and Context
- 2. AI Alignment Challenges and Approaches
- 3. Value Alignment and Specification
- 4. Constitutional AI and Safety by Design
- 5. Responsible AI Development Practices
- 6. Security-First ML Engineering
- 7. Summary and Looking Ahead
- References and Further Reading
- Appendix: Discussion Questions