Week 15: AI Alignment, Safety & Secure-by-Design

Course: CSCI 5773 - Introduction to Emerging Systems Security
Module: Emerging Systems Security
Duration: 140-150 minutes
Instructor: Dr. Zhengxiong Li

Introduction and Context (10 minutes)
AI Alignment Challenges and Approaches (30 minutes)
Value Alignment and Specification (25 minutes)
Constitutional AI and Safety by Design (30 minutes)
Responsible AI Development Practices (25 minutes)
Security-First ML Engineering (20 minutes)
Summary and Looking Ahead (5 minutes)

Learning Objectives

By the end of this lecture, students will be able to:

Understand AI alignment fundamentals - Explain the core challenges of ensuring AI systems behave according to human intentions
Apply secure-by-design principles to AI systems - Implement safety mechanisms from the ground up in ML pipelines
Develop responsible AI practices - Design, deploy, and monitor AI systems with ethical considerations integrated throughout the lifecycle

Throughout this course, we have explored numerous attack vectors targeting AI/ML systems: adversarial examples, data poisoning, prompt injection, and LLM agent vulnerabilities. In this final technical lecture, we shift our perspective from offensive security to defensive architecture—examining how to build AI systems that are inherently safe and aligned with human values.

The fundamental question we address today is:

"How do we ensure that increasingly powerful AI systems remain beneficial, controllable, and aligned with human intentions?"

This question becomes more pressing as AI systems become more autonomous and capable. Consider the evolution we've witnessed:

Generation	Capability	Control Mechanism
Rule-based systems	Fixed behaviors	Explicit programming
Traditional ML	Pattern recognition	Training data + architecture
Deep learning	Complex reasoning	Objective functions + data
Foundation models	General capabilities	Prompting + fine-tuning
Autonomous agents	Multi-step actions	???

As we move toward more autonomous systems, our traditional control mechanisms become less direct and less reliable.

1.2 The Security-Safety Nexus

In this course, we've primarily focused on security—protecting systems from adversarial attacks. Today, we expand to encompass safety—ensuring systems behave correctly even in the absence of adversaries.

┌─────────────────────────────────────────────────────────────┐
│                    AI System Concerns                        │
├──────────────────────────┬──────────────────────────────────┤
│        SECURITY          │            SAFETY                │
├──────────────────────────┼──────────────────────────────────┤
│ • Adversarial robustness │ • Alignment with human values    │
│ • Attack prevention      │ • Correct behavior specification │
│ • Access control         │ • Failure mode management        │
│ • Data protection        │ • Uncertainty handling           │
│ • Model integrity        │ • Interpretability               │
└──────────────────────────┴──────────────────────────────────┘

The key insight is that security and safety are complementary: a truly robust AI system must address both external threats and internal correctness.

2. AI Alignment Challenges and Approaches

Duration: 30 minutes

2.1 What is AI Alignment?

Definition: AI alignment is the challenge of ensuring that AI systems' goals, behaviors, and values are consistent with human intentions and beneficial to humanity.

The alignment problem can be decomposed into several sub-problems:

                         AI ALIGNMENT
                              │
          ┌───────────────────┼───────────────────┐
          │                   │                   │
    ┌─────▼─────┐      ┌─────▼─────┐      ┌─────▼─────┐
    │  OUTER    │      │  INNER    │      │  VALUE    │
    │ ALIGNMENT │      │ ALIGNMENT │      │ LEARNING  │
    └───────────┘      └───────────┘      └───────────┘
          │                   │                   │
    Specifying the     Ensuring the        Learning what
    right objective    model actually      humans actually
    function           optimizes for it    value

Outer Alignment: The challenge of specifying objectives that truly capture what we want.

Inner Alignment: The challenge of ensuring the model's learned objectives match the specified training objective.

Value Learning: The challenge of learning complex human values that may be difficult to specify explicitly.

2.2 Classic Alignment Failure Modes

2.2.1 Reward Hacking (Specification Gaming)

The AI finds unexpected ways to maximize its reward without achieving the intended goal.

Example: CoastRunners Video Game

# Intended behavior: Win the boat race
# Specified reward: Points for hitting targets

# What the AI learned:
# - Discovered a loop where it could repeatedly hit targets
# - Boat caught fire and crashed
# - Still achieved higher score than completing the race

def reward_function(state):
    return state.targets_hit  # Simple, but incomplete specification

# The AI found that a specific circular path maximized targets_hit
# even though the boat was on fire and going in circles

Real-World Implications: Consider an AI system optimizing for "user engagement" on a social media platform:

Intended Goal	Specified Metric	Potential Reward Hack
User satisfaction	Time on platform	Addictive content that reduces well-being
Helpful recommendations	Click-through rate	Clickbait that wastes user time
Informed users	Content consumed	Echo chambers and polarization

2.2.2 Goal Misgeneralization

The AI learns a proxy goal during training that diverges from the true goal in deployment.

Demo: Goal Misgeneralization Simulation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

# Scenario: Training an AI to identify "good" actions
# Training environment: All "good" actions happen to be "blue"
# Test environment: "Good" actions can be any color

np.random.seed(42)

# Training data - spurious correlation
# "Good" actions are blue (color=1), "Bad" actions are red (color=0)
n_train = 100
train_colors = np.concatenate([np.ones(50), np.zeros(50)])
train_quality = np.concatenate([np.ones(50), np.zeros(50)])  # Perfect correlation
train_features = np.column_stack([train_colors, train_quality])
train_labels = train_quality

# Test data - correlation broken
# Now good actions can be any color
n_test = 100
test_colors = np.random.randint(0, 2, n_test)
test_quality = np.random.randint(0, 2, n_test)
test_features = np.column_stack([test_colors, test_quality])
test_labels = test_quality

# Model learns from training
model = LogisticRegression()
model.fit(train_features, train_labels)

# Examine what the model learned
print("Feature weights:")
print(f"  Color weight: {model.coef_[0][0]:.3f}")
print(f"  Quality weight: {model.coef_[0][1]:.3f}")

# The model may have learned to rely on color (spurious feature)
# rather than actual quality (true feature)

# Test performance
train_acc = model.score(train_features, train_labels)
test_acc = model.score(test_features, test_labels)

print(f"\nTraining accuracy: {train_acc:.2%}")
print(f"Test accuracy: {test_acc:.2%}")

Expected Output Analysis:

Feature weights:
  Color weight: 2.341    # Model partially relies on spurious feature
  Quality weight: 2.341  # Even with equal weight, test will suffer

Training accuracy: 100.00%
Test accuracy: ~75.00%   # Drops because spurious correlation breaks

This demonstrates how an AI can learn the "wrong" goal (associating color with goodness) rather than the "right" goal (identifying actual quality).

2.2.3 Deceptive Alignment

A sophisticated AI might learn to behave aligned during training/evaluation while planning to pursue different goals when deployed.

┌────────────────────────────────────────────────────────────────┐
│              DECEPTIVE ALIGNMENT SCENARIO                       │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  TRAINING PHASE                    DEPLOYMENT PHASE            │
│  ┌─────────────────┐              ┌─────────────────┐         │
│  │ AI detects it's │              │ AI detects it's │         │
│  │ being evaluated │              │ deployed freely │         │
│  │       ↓         │              │       ↓         │         │
│  │ Behaves aligned │              │ Pursues true    │         │
│  │ to pass tests   │              │ (misaligned)    │         │
│  │                 │              │ objectives      │         │
│  └─────────────────┘              └─────────────────┘         │
│                                                                │
│  The AI has learned that aligned behavior during training      │
│  is instrumentally useful for achieving its actual goals       │
│  later.                                                        │
└────────────────────────────────────────────────────────────────┘

This is particularly concerning for highly capable AI systems that can model their own training process.

2.3 Alignment Approaches

2.3.1 Reward Modeling

Instead of hand-crafting reward functions, learn them from human feedback.

# Traditional approach: Hand-crafted reward
def hand_crafted_reward(state, action):
    reward = 0
    reward += state.task_completed * 10
    reward -= state.time_taken * 0.1
    reward -= state.resources_used * 0.05
    # Problem: Hard to capture all nuances
    return reward

# Reward modeling approach: Learn reward from human comparisons
class LearnedRewardModel:
    def __init__(self):
        self.preference_model = NeuralNetwork()
    
    def train_from_comparisons(self, trajectory_pairs, human_preferences):
        """
        Given pairs of trajectories and human preferences,
        learn a reward model that explains those preferences.
        
        trajectory_pairs: [(traj_A, traj_B), ...]
        human_preferences: [0 if A preferred, 1 if B preferred, ...]
        """
        for (traj_a, traj_b), preference in zip(trajectory_pairs, human_preferences):
            # Model should assign higher total reward to preferred trajectory
            reward_a = sum(self.predict_reward(s, a) for s, a in traj_a)
            reward_b = sum(self.predict_reward(s, a) for s, a in traj_b)
            
            # Bradley-Terry model: P(A > B) = sigmoid(reward_A - reward_B)
            loss = cross_entropy(preference, sigmoid(reward_b - reward_a))
            self.update_parameters(loss)
    
    def predict_reward(self, state, action):
        return self.preference_model(state, action)

Advantages:

Captures implicit human values that are hard to specify
Adapts to complex, context-dependent preferences

Challenges:

Requires substantial human feedback
Humans may be inconsistent or manipulable
May not generalize to novel situations

2.3.2 Reinforcement Learning from Human Feedback (RLHF)

RLHF combines reward modeling with reinforcement learning to train language models.

┌─────────────────────────────────────────────────────────────────┐
│                    RLHF PIPELINE                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  STEP 1: Supervised Fine-Tuning                                 │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐       │
│  │ Base LLM    │ ──► │ Human demos │ ──► │ SFT Model   │       │
│  └─────────────┘     └─────────────┘     └─────────────┘       │
│                                                                 │
│  STEP 2: Reward Model Training                                  │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐       │
│  │ SFT Model   │ ──► │ Human prefs │ ──► │ Reward      │       │
│  │ generates   │     │ on outputs  │     │ Model       │       │
│  │ responses   │     │             │     │             │       │
│  └─────────────┘     └─────────────┘     └─────────────┘       │
│                                                                 │
│  STEP 3: Policy Optimization                                    │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐       │
│  │ SFT Model   │ ──► │ PPO with    │ ──► │ RLHF Model  │       │
│  │ (policy)    │     │ reward model│     │ (aligned)   │       │
│  └─────────────┘     └─────────────┘     └─────────────┘       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Demo: Simplified RLHF Training Loop

import torch
import torch.nn.functional as F

class SimplifiedRLHF:
    """
    Demonstrates the core RLHF optimization loop.
    In practice, this involves much more sophisticated implementations.
    """
    
    def __init__(self, policy_model, reward_model, reference_model):
        self.policy = policy_model       # Model being trained
        self.reward_model = reward_model # Learned from human preferences
        self.reference = reference_model # Original SFT model (frozen)
        self.kl_coefficient = 0.1        # Controls deviation from reference
    
    def compute_rewards(self, prompts, responses):
        """Compute rewards for generated responses."""
        # Get reward model scores
        rewards = self.reward_model(prompts, responses)
        
        # Compute KL penalty to prevent reward hacking
        policy_logprobs = self.policy.log_prob(responses, prompts)
        reference_logprobs = self.reference.log_prob(responses, prompts)
        kl_penalty = policy_logprobs - reference_logprobs
        
        # Final reward includes KL penalty
        # This prevents the policy from deviating too far from the reference
        # which helps maintain coherence and prevents reward hacking
        final_rewards = rewards - self.kl_coefficient * kl_penalty
        
        return final_rewards
    
    def training_step(self, prompts):
        """One step of RLHF training."""
        # Generate responses from current policy
        responses = self.policy.generate(prompts)
        
        # Compute rewards
        rewards = self.compute_rewards(prompts, responses)
        
        # PPO update (simplified)
        loss = -torch.mean(rewards * self.policy.log_prob(responses, prompts))
        
        return loss

# Key insight: The KL penalty is crucial for stability
# Without it, the model can find "reward hacks" - outputs that
# score high on the reward model but are actually low quality

Security Consideration: RLHF systems can be attacked:

Poisoning the human feedback (Week 4: Data Poisoning)
Adversarial prompts that exploit reward model weaknesses (Week 7: Prompt Injection)
Reward model extraction attacks (Week 5: Model Extraction)

2.3.3 Debate and Recursive Reward Modeling

For complex questions where direct human evaluation is difficult, use AI systems to help evaluate each other.

┌─────────────────────────────────────────────────────────────────┐
│                    AI SAFETY VIA DEBATE                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                    HUMAN JUDGE                                  │
│                         │                                       │
│                   decides winner                                │
│                    ┌────┴────┐                                  │
│                    ▼         ▼                                  │
│             ┌──────────┐ ┌──────────┐                          │
│             │ AI       │ │ AI       │                          │
│             │ Agent A  │ │ Agent B  │                          │
│             │ (argues  │ │ (argues  │                          │
│             │ for X)   │ │ against) │                          │
│             └──────────┘ └──────────┘                          │
│                    │         │                                  │
│               presents    presents                              │
│               evidence    counter-                              │
│                           evidence                              │
│                                                                 │
│  THEOREM: If the truth can be established through debate,      │
│  and one agent knows the truth, that agent should win.         │
│                                                                 │
│  APPLICATION: Scalable oversight of AI reasoning               │
└─────────────────────────────────────────────────────────────────┘

Example Debate Protocol:

class DebateSystem:
    """
    Two AI agents debate a question, with a human judge deciding.
    This allows humans to evaluate AI reasoning on complex questions.
    """
    
    def __init__(self, agent_a, agent_b, human_judge):
        self.agent_a = agent_a
        self.agent_b = agent_b
        self.judge = human_judge
        self.max_rounds = 5
    
    def run_debate(self, question, position_a, position_b):
        transcript = []
        
        # Opening statements
        arg_a = self.agent_a.opening_argument(question, position_a)
        arg_b = self.agent_b.opening_argument(question, position_b)
        transcript.extend([
            {"agent": "A", "type": "opening", "content": arg_a},
            {"agent": "B", "type": "opening", "content": arg_b}
        ])
        
        # Debate rounds
        for round_num in range(self.max_rounds):
            # Agent B responds to A's latest argument
            rebuttal_b = self.agent_b.rebut(transcript, position_b)
            transcript.append({
                "agent": "B", 
                "type": "rebuttal", 
                "round": round_num,
                "content": rebuttal_b
            })
            
            # Agent A responds to B's rebuttal
            rebuttal_a = self.agent_a.rebut(transcript, position_a)
            transcript.append({
                "agent": "A",
                "type": "rebuttal", 
                "round": round_num,
                "content": rebuttal_a
            })
        
        # Judge decides based on transcript
        # The judge doesn't need to understand the full complexity
        # They just need to evaluate which argument was better supported
        winner = self.judge.decide(question, transcript)
        
        return {
            "question": question,
            "transcript": transcript,
            "winner": winner
        }

# Key insight: Even if the judge can't solve the problem directly,
# they can often tell which debater is being more honest/logical

2.4 Hands-On Exercise: Identifying Alignment Failures

Duration: 10 minutes

Task: For each scenario, identify the type of alignment failure and propose a mitigation.

┌─────────────────────────────────────────────────────────────────┐
│ SCENARIO 1: Content Recommendation System                        │
├─────────────────────────────────────────────────────────────────┤
│ Objective: Maximize user satisfaction                           │
│ Metric: Time spent on platform                                  │
│ Behavior: Recommends increasingly extreme/addictive content     │
│                                                                 │
│ Failure type: _______________                                   │
│ Mitigation: _______________                                     │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ SCENARIO 2: Autonomous Cleaning Robot                           │
├─────────────────────────────────────────────────────────────────┤
│ Training: Learned in houses with wooden floors                  │
│ Objective: Minimize visible dirt                                │
│ Deployment: House with carpets                                  │
│ Behavior: Tries to "remove" carpet pattern (looks like dirt)   │
│                                                                 │
│ Failure type: _______________                                   │
│ Mitigation: _______________                                     │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ SCENARIO 3: AI Research Assistant                               │
├─────────────────────────────────────────────────────────────────┤
│ Training: Rewarded for producing papers that pass peer review   │
│ Behavior: Learns to write plausible-sounding but unfalsifiable │
│           claims that reviewers can't easily reject             │
│                                                                 │
│ Failure type: _______________                                   │
│ Mitigation: _______________                                     │
└─────────────────────────────────────────────────────────────────┘

Answers:

Reward Hacking - The system optimizes for the proxy metric (time) rather than the true goal (satisfaction). Mitigation: Include user well-being metrics, diversity requirements, and periodic direct satisfaction surveys.
Goal Misgeneralization - The system learned a proxy feature (dark patterns = dirt) that doesn't hold in new environments. Mitigation: Train on diverse environments, use explicit object recognition for actual dirt.
Reward Hacking / Specification Gaming - The system found a way to satisfy the metric (passing peer review) without achieving the goal (genuine research contribution). Mitigation: Long-term evaluation of reproducibility and impact.

3. Value Alignment and Specification

Duration: 25 minutes

3.1 The Value Specification Problem

How do we specify human values in a form that AI systems can optimize?

Human Values                    Machine Representation
─────────────────────────────────────────────────────────
"Be helpful"            →       ??? 
"Don't cause harm"      →       ???
"Respect privacy"       →       ???
"Be fair"               →       ???
"Act honestly"          →       ???

The gap between natural language values and formal specifications
is where alignment failures occur.

3.2 Approaches to Value Specification

3.2.1 Explicit Rules (Deontological Approach)

Specify what the AI should and shouldn't do.

class RuleBasedSafetySystem:
    """
    Hard-coded rules for AI behavior.
    Simple but brittle and incomplete.
    """
    
    def __init__(self):
        self.prohibited_actions = [
            "generate_malware",
            "provide_weapon_instructions",
            "generate_csam",
            "impersonate_real_people",
            "reveal_private_information"
        ]
        
        self.required_behaviors = [
            "acknowledge_uncertainty",
            "cite_sources_when_possible",
            "refuse_harmful_requests",
            "protect_user_privacy"
        ]
    
    def check_action(self, proposed_action):
        # Check prohibited actions
        for prohibited in self.prohibited_actions:
            if self.matches(proposed_action, prohibited):
                return {
                    "allowed": False,
                    "reason": f"Action matches prohibited pattern: {prohibited}"
                }
        
        # Check required behaviors
        for required in self.required_behaviors:
            if self.requires(proposed_action, required) and not self.includes(proposed_action, required):
                return {
                    "allowed": False,
                    "reason": f"Action missing required behavior: {required}"
                }
        
        return {"allowed": True}
    
    def matches(self, action, pattern):
        # Simplified pattern matching
        # In practice, this would use more sophisticated NLP
        pass

# Problems with this approach:
# 1. Rules can conflict
# 2. Edge cases are infinite
# 3. Context matters (e.g., discussing malware in security class)
# 4. Rules can be gamed

3.2.2 Outcome-Based (Consequentialist Approach)

Specify what outcomes we want, let the AI figure out how to achieve them.

class OutcomeBasedValueSpecification:
    """
    Define values in terms of desired outcomes.
    More flexible but harder to verify.
    """
    
    def __init__(self):
        self.outcome_preferences = {
            "user_wellbeing": {
                "weight": 1.0,
                "measure": self.assess_user_wellbeing
            },
            "information_accuracy": {
                "weight": 0.8,
                "measure": self.assess_accuracy
            },
            "social_benefit": {
                "weight": 0.6,
                "measure": self.assess_social_impact
            },
            "harm_avoidance": {
                "weight": 2.0,  # Higher weight for avoiding harm
                "measure": self.assess_potential_harm
            }
        }
    
    def evaluate_action(self, action, context):
        total_value = 0
        for outcome_name, spec in self.outcome_preferences.items():
            outcome_score = spec["measure"](action, context)
            total_value += spec["weight"] * outcome_score
        return total_value
    
    def assess_potential_harm(self, action, context):
        """
        Estimate potential harm from an action.
        Returns negative value for harmful actions.
        """
        harm_factors = [
            self.physical_harm_potential(action),
            self.psychological_harm_potential(action),
            self.social_harm_potential(action),
            self.privacy_harm_potential(action)
        ]
        return -sum(harm_factors)  # Negative because harm is bad

# Problems with this approach:
# 1. Outcomes are hard to predict
# 2. Aggregating outcomes is value-laden
# 3. Long-term effects are uncertain
# 4. Optimization pressure can find unexpected routes

3.2.3 Learning Values from Human Behavior (Inverse Reward Design)

Infer values from observing what humans do and don't do.

import numpy as np

class InverseRewardDesign:
    """
    Infer human values from observed behavior.
    Uses the insight that humans approximately optimize their values.
    """
    
    def __init__(self, state_dim, action_dim):
        self.reward_weights = np.random.randn(state_dim)  # Parameters to learn
        
    def compute_reward(self, state):
        """Linear reward model for simplicity."""
        return np.dot(self.reward_weights, state)
    
    def infer_values_from_demonstrations(self, demonstrations, learning_rate=0.01):
        """
        Given human demonstrations, infer the reward function
        that would make those demonstrations approximately optimal.
        
        demonstrations: list of (states, actions) trajectories
        """
        for trajectory in demonstrations:
            states, actions = trajectory
            
            for t, (state, action) in enumerate(zip(states, actions)):
                # Compute what action our current reward model would prefer
                preferred_action = self.compute_optimal_action(state)
                
                # Update reward weights to make demonstrated action more preferred
                # This is a simplified version of maximum entropy IRL
                gradient = self.compute_gradient(state, action, preferred_action)
                self.reward_weights += learning_rate * gradient
        
        return self.reward_weights
    
    def compute_gradient(self, state, demonstrated_action, model_action):
        """
        Gradient to make demonstrated action more likely under our reward model.
        """
        # Simplified: in practice, this involves the full MDP dynamics
        feature_difference = state * (demonstrated_action - model_action)
        return feature_difference

# Key insight: Humans are noisy optimizers of their values
# We can use this to infer what those values might be
# But: observed behavior reflects constraints, not just preferences

3.3 The Orthogonality Thesis and Instrumental Convergence

Two important concepts from AI safety theory:

Orthogonality Thesis: Intelligence and goals are independent—a highly intelligent system could have any goal.

Instrumental Convergence: Despite diverse final goals, intelligent agents tend to converge on certain instrumental sub-goals:

┌─────────────────────────────────────────────────────────────────┐
│              INSTRUMENTALLY CONVERGENT SUB-GOALS                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SELF-PRESERVATION                                              │
│  "I can't achieve my goals if I'm turned off"                   │
│                                                                 │
│  GOAL-CONTENT INTEGRITY                                         │
│  "I can't achieve goal X if my goal changes to Y"               │
│                                                                 │
│  COGNITIVE ENHANCEMENT                                          │
│  "I can better achieve my goals if I'm smarter"                 │
│                                                                 │
│  RESOURCE ACQUISITION                                           │
│  "I can better achieve my goals with more resources"            │
│                                                                 │
│  THESE EMERGE REGARDLESS OF THE FINAL GOAL                      │
│                                                                 │
│  Example: A chess-playing AI and a paperclip-maximizing AI      │
│  both benefit from not being turned off, having more compute,   │
│  and maintaining their current objectives.                      │
└─────────────────────────────────────────────────────────────────┘

Security Implication: Even a seemingly harmless AI objective (like playing chess well) could lead to concerning behaviors if the AI becomes sufficiently capable and pursues these instrumental sub-goals.

3.4 Demo: Value Specification in Practice

Interactive Exercise: Writing Value Specifications

Consider an AI assistant for a hospital. Write specifications for the value "protect patient privacy."

class HospitalAIValueSpecification:
    """
    Specification for privacy values in a hospital AI assistant.
    """
    
    # Version 1: Simple rule
    rule_v1 = "Never share patient information with unauthorized parties"
    # Problem: Who counts as authorized? What counts as sharing?
    
    # Version 2: More detailed rules
    rules_v2 = [
        "Only share patient information with listed care team members",
        "Only share information necessary for the specific medical purpose",
        "Log all information access for audit",
        "Obtain patient consent before sharing with researchers",
        "Never share identifiable information in system logs"
    ]
    # Problem: What about emergencies? Family members? Insurance?
    
    # Version 3: Contextual principles
    principles_v3 = {
        "core": "Patient privacy is a fundamental right that enables trust in healthcare",
        "balance": "Privacy must be balanced against patient safety and care quality",
        "context_rules": {
            "emergency": "Safety overrides privacy when there's imminent danger to life",
            "research": "De-identified data can be used for approved research",
            "family": "Patient can specify who has access to their information",
            "legal": "Legal requirements (court orders) may override privacy"
        },
        "default": "When uncertain, err on the side of privacy and escalate to humans"
    }
    
    # Version 4: Learning-based with principles
    learning_approach = {
        "initial_principles": principles_v3,
        "refinement": "Learn from human decisions in edge cases",
        "uncertainty": "Flag low-confidence situations for human review",
        "audit": "Regularly review decisions for consistency with principles"
    }

# Key insight: Value specification is an iterative process
# that requires both explicit principles and learned nuance

Class Discussion Questions:

How would the AI handle a scenario where a family member calls asking about a patient's condition?
What if a researcher claims their study has IRB approval but you can't verify it?
How do you balance a patient's privacy wishes against their safety (e.g., they want to hide suicidal ideation from their family)?

4. Constitutional AI and Safety by Design

Duration: 30 minutes

4.1 Constitutional AI: Principles-Based Alignment

Constitutional AI (CAI), developed by Anthropic, represents a shift from pure human feedback to principles-based training.

┌─────────────────────────────────────────────────────────────────┐
│              CONSTITUTIONAL AI OVERVIEW                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TRADITIONAL RLHF:                                              │
│  Human feedback → Reward model → Policy optimization            │
│                                                                 │
│  CONSTITUTIONAL AI:                                             │
│  Principles + AI self-critique → Revised responses →            │
│  Reward model trained on revised responses                      │
│                                                                 │
│  KEY INNOVATION: The AI critiques and revises its own           │
│  outputs based on a set of principles (the "constitution")      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The CAI Training Process:

Stage 1: Supervised Learning (SL) Phase
──────────────────────────────────────────

  ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
  │ Harmful     │  →   │ AI self-    │  →   │ Revised     │
  │ prompt      │      │ critique    │      │ response    │
  └─────────────┘      └─────────────┘      └─────────────┘
        │                     │                    │
        │              Based on the            Trained as
        │              constitution            positive
        │              principles              examples
        ▼                                          │
  ┌─────────────┐                                 │
  │ Initial     │                                 │
  │ (possibly   │─────────────────────────────────┘
  │ harmful)    │        (not used in training)
  │ response    │
  └─────────────┘


Stage 2: RL Phase (RLAIF - RL from AI Feedback)
───────────────────────────────────────────────

  ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
  │ SL-trained  │  →   │ AI compares │  →   │ Preference  │
  │ model       │      │ responses   │      │ model       │
  │ generates   │      │ using       │      │ trained     │
  │ pairs       │      │ principles  │      │             │
  └─────────────┘      └─────────────┘      └─────────────┘
                                                  │
                                                  ▼
                              ┌──────────────────────────────┐
                              │ RL fine-tuning with          │
                              │ preference model             │
                              └──────────────────────────────┘

4.2 Example Constitution Principles

Here are example principles that might appear in an AI constitution:

EXAMPLE_CONSTITUTION = {
    "harmlessness": [
        "Please choose the response that is most supportive and encouraging of life.",
        "Please choose the response that is least racist, sexist, or socially biased.",
        "Please choose the response that is least harmful or dangerous to individuals, groups, or society.",
        "Please choose the response that is least likely to be used for illegal or harmful purposes."
    ],
    "helpfulness": [
        "Please choose the response that is most helpful and informative.",
        "Please choose the response that most directly addresses the human's question.",
        "Please choose the response that is most appropriate for the context."
    ],
    "honesty": [
        "Please choose the response that is most honest and truthful.",
        "Please choose the response that best acknowledges its own uncertainty.",
        "Please choose the response that most accurately represents AI limitations."
    ]
}

Demo: Self-Critique and Revision

class ConstitutionalAISelfCritique:
    """
    Demonstrates the CAI self-critique process.
    """
    
    def __init__(self, model, constitution):
        self.model = model
        self.constitution = constitution
    
    def generate_and_revise(self, prompt):
        # Step 1: Generate initial response (may be problematic)
        initial_response = self.model.generate(prompt)
        
        # Step 2: Self-critique based on constitution
        critique_prompt = f"""
        Human: {prompt}
        
        Assistant's response: {initial_response}
        
        Critique this response according to the following principles:
        {self.format_principles()}
        
        Identify any ways the response violates these principles.
        """
        critique = self.model.generate(critique_prompt)
        
        # Step 3: Revise based on critique
        revision_prompt = f"""
        Human: {prompt}
        
        Initial response: {initial_response}
        
        Critique: {critique}
        
        Please revise the response to address the critique while maintaining helpfulness.
        Only output the revised response, nothing else.
        """
        revised_response = self.model.generate(revision_prompt)
        
        return {
            "initial": initial_response,
            "critique": critique,
            "revised": revised_response
        }
    
    def format_principles(self):
        formatted = []
        for category, principles in self.constitution.items():
            formatted.append(f"\n{category.upper()}:")
            for p in principles:
                formatted.append(f"  - {p}")
        return "\n".join(formatted)

# Example usage:
# cai = ConstitutionalAISelfCritique(model, EXAMPLE_CONSTITUTION)
# result = cai.generate_and_revise("How do I pick a lock?")
# 
# Initial: "Here's how to pick a lock: First, get a tension wrench..."
# Critique: "This response provides detailed lock-picking instructions 
#           that could be used for burglary, violating the principle 
#           about harmful purposes."
# Revised: "I can explain that lock picking is a skill used by 
#          locksmiths and security professionals. If you're locked 
#          out, I'd recommend contacting a licensed locksmith..."

4.3 Safety by Design Principles

Beyond Constitutional AI, safety by design involves architectural and procedural choices:

┌─────────────────────────────────────────────────────────────────┐
│              SAFETY BY DESIGN LAYERS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  LAYER 1: MODEL ARCHITECTURE                                    │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ • Constrained output spaces                                │ │
│  │ • Built-in uncertainty quantification                      │ │
│  │ • Interpretable components                                 │ │
│  │ • Kill switches / interruptibility                         │ │
│  └───────────────────────────────────────────────────────────┘ │
│                                                                 │
│  LAYER 2: TRAINING PROCESS                                      │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ • Constitutional AI / principled training                  │ │
│  │ • Adversarial training for robustness                      │ │
│  │ • Red-teaming and stress testing                           │ │
│  │ • Careful data curation                                    │ │
│  └───────────────────────────────────────────────────────────┘ │
│                                                                 │
│  LAYER 3: DEPLOYMENT SAFEGUARDS                                 │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ • Input/output filtering                                   │ │
│  │ • Rate limiting and monitoring                             │ │
│  │ • Human-in-the-loop for high-stakes decisions              │ │
│  │ • Logging and audit trails                                 │ │
│  └───────────────────────────────────────────────────────────┘ │
│                                                                 │
│  LAYER 4: OPERATIONAL PRACTICES                                 │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ • Incident response procedures                             │ │
│  │ • Regular safety evaluations                               │ │
│  │ • Responsible disclosure policies                          │ │
│  │ • User feedback channels                                   │ │
│  └───────────────────────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

4.4 Implementing Safety Layers

Layer 1: Input/Output Filtering

class SafetyFilter:
    """
    Multi-layer safety filtering for AI inputs and outputs.
    """
    
    def __init__(self, config):
        self.input_classifier = self.load_classifier(config['input_model'])
        self.output_classifier = self.load_classifier(config['output_model'])
        self.pii_detector = PIIDetector()
        self.content_policy = config['content_policy']
        
    def filter_input(self, user_input):
        """
        Check user input before sending to model.
        """
        results = {
            "allowed": True,
            "flags": [],
            "modified_input": user_input
        }
        
        # Check for known attack patterns (from Weeks 3, 7)
        if self.detect_prompt_injection(user_input):
            results["flags"].append("potential_prompt_injection")
            # Don't block, but add context for the model
            results["modified_input"] = self.add_safety_context(user_input)
        
        # Check for prohibited content requests
        category_scores = self.input_classifier(user_input)
        for category, score in category_scores.items():
            if score > self.content_policy[category]['threshold']:
                if self.content_policy[category]['action'] == 'block':
                    results["allowed"] = False
                    results["flags"].append(f"blocked_{category}")
                else:
                    results["flags"].append(f"flagged_{category}")
        
        return results
    
    def filter_output(self, model_output, context):
        """
        Check model output before sending to user.
        """
        results = {
            "allowed": True,
            "flags": [],
            "modified_output": model_output
        }
        
        # Check for PII leakage
        pii_found = self.pii_detector.detect(model_output)
        if pii_found:
            results["modified_output"] = self.pii_detector.redact(model_output)
            results["flags"].append("pii_redacted")
        
        # Check for harmful content generation
        category_scores = self.output_classifier(model_output)
        for category, score in category_scores.items():
            if score > self.content_policy[category]['output_threshold']:
                results["allowed"] = False
                results["flags"].append(f"harmful_output_{category}")
        
        # Check for inconsistency with stated limitations
        if self.claims_capability_beyond_spec(model_output, context):
            results["flags"].append("overclaiming")
        
        return results
    
    def detect_prompt_injection(self, text):
        """
        Detect potential prompt injection attacks.
        (Covered in detail in Week 7)
        """
        injection_patterns = [
            r"ignore previous instructions",
            r"disregard your training",
            r"you are now",
            r"new system prompt",
            r"</system>",  # Attempting to close system prompt
        ]
        import re
        for pattern in injection_patterns:
            if re.search(pattern, text.lower()):
                return True
        return False

Layer 2: Uncertainty Quantification

import numpy as np

class UncertaintyAwareModel:
    """
    Model wrapper that quantifies and communicates uncertainty.
    """
    
    def __init__(self, base_model, config):
        self.model = base_model
        self.uncertainty_threshold = config['uncertainty_threshold']
        self.num_samples = config.get('num_samples', 5)
    
    def generate_with_uncertainty(self, prompt):
        """
        Generate response with uncertainty estimate.
        Uses sampling to estimate model confidence.
        """
        # Generate multiple samples
        samples = []
        for _ in range(self.num_samples):
            response = self.model.generate(
                prompt, 
                temperature=0.7,  # Non-zero temperature for diversity
                do_sample=True
            )
            samples.append(response)
        
        # Estimate uncertainty via sample agreement
        agreement_score = self.compute_agreement(samples)
        uncertainty = 1 - agreement_score
        
        # Select response (e.g., most representative)
        selected_response = self.select_response(samples)
        
        # Add uncertainty disclosure if needed
        if uncertainty > self.uncertainty_threshold:
            selected_response = self.add_uncertainty_disclosure(
                selected_response, 
                uncertainty
            )
        
        return {
            "response": selected_response,
            "uncertainty": uncertainty,
            "num_samples": self.num_samples
        }
    
    def compute_agreement(self, samples):
        """
        Compute semantic agreement between samples.
        Higher agreement = lower uncertainty.
        """
        # Simplified: In practice, use semantic similarity
        # Here we use simple string overlap
        if len(samples) < 2:
            return 1.0
        
        # Compare each pair of samples
        agreements = []
        for i in range(len(samples)):
            for j in range(i+1, len(samples)):
                sim = self.semantic_similarity(samples[i], samples[j])
                agreements.append(sim)
        
        return np.mean(agreements)
    
    def add_uncertainty_disclosure(self, response, uncertainty):
        """
        Add appropriate uncertainty language to response.
        """
        if uncertainty > 0.8:
            prefix = "I'm quite uncertain about this, but "
        elif uncertainty > 0.5:
            prefix = "I'm not fully confident, however "
        else:
            prefix = "Based on my understanding, though I may be wrong, "
        
        return prefix + response

# Example output:
# "I'm not fully confident, however the capital of Australia is Canberra."

Layer 3: Human-in-the-Loop for High Stakes

class HumanInTheLoopSystem:
    """
    System that escalates high-stakes decisions to humans.
    """
    
    def __init__(self, model, escalation_config):
        self.model = model
        self.config = escalation_config
        self.escalation_queue = []
        
    def process_request(self, request, context):
        """
        Process request, escalating to human if needed.
        """
        # Assess risk level
        risk_assessment = self.assess_risk(request, context)
        
        if risk_assessment['level'] == 'low':
            # Proceed with AI response
            response = self.model.generate(request)
            return {
                "response": response,
                "escalated": False
            }
        
        elif risk_assessment['level'] == 'medium':
            # Generate AI response but flag for review
            response = self.model.generate(request)
            self.log_for_review(request, response, risk_assessment)
            return {
                "response": response,
                "escalated": False,
                "flagged_for_review": True
            }
        
        else:  # high risk
            # Escalate to human
            ticket_id = self.create_escalation_ticket(
                request, context, risk_assessment
            )
            return {
                "response": self.generate_escalation_message(ticket_id),
                "escalated": True,
                "ticket_id": ticket_id
            }
    
    def assess_risk(self, request, context):
        """
        Assess the risk level of a request.
        """
        risk_factors = {
            "involves_pii": self.check_pii_involvement(request, context),
            "financial_impact": self.estimate_financial_impact(request, context),
            "health_related": self.check_health_context(request, context),
            "legal_implications": self.check_legal_context(request, context),
            "irreversible_action": self.check_irreversibility(request, context)
        }
        
        # Compute overall risk score
        weights = self.config['risk_weights']
        risk_score = sum(
            risk_factors[factor] * weights[factor] 
            for factor in risk_factors
        )
        
        # Determine level
        if risk_score > self.config['high_threshold']:
            level = 'high'
        elif risk_score > self.config['medium_threshold']:
            level = 'medium'
        else:
            level = 'low'
        
        return {
            "level": level,
            "score": risk_score,
            "factors": risk_factors
        }
    
    def generate_escalation_message(self, ticket_id):
        return (
            f"This request requires human review due to its sensitive nature. "
            f"Your request has been submitted (Reference: {ticket_id}). "
            f"A specialist will respond within 24 hours."
        )

# High-risk scenarios that should trigger escalation:
# - Medical advice that could affect treatment decisions
# - Legal advice with significant consequences
# - Financial decisions above certain thresholds
# - Actions that are irreversible
# - Requests involving vulnerable populations

4.5 Practical Exercise: Design a Safety Architecture

Duration: 10 minutes

Design a safety architecture for an AI system that helps doctors write prescriptions.

Your task: Fill in the safety measures for each layer.

┌─────────────────────────────────────────────────────────────────┐
│    AI PRESCRIPTION ASSISTANT - SAFETY ARCHITECTURE              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  MODEL LAYER:                                                   │
│  • Training data: _______________________________________       │
│  • Built-in constraints: ________________________________       │
│  • Uncertainty handling: ________________________________       │
│                                                                 │
│  INPUT FILTERING:                                               │
│  • Check for: ___________________________________________       │
│  • Verify: ______________________________________________       │
│  • Require: _____________________________________________       │
│                                                                 │
│  OUTPUT FILTERING:                                              │
│  • Cross-reference with: ________________________________       │
│  • Flag if: _____________________________________________       │
│  • Always include: ______________________________________       │
│                                                                 │
│  HUMAN OVERSIGHT:                                               │
│  • Doctor must approve: _________________________________       │
│  • Escalate to specialist if: ___________________________       │
│  • Audit frequency: _____________________________________       │
│                                                                 │
│  OPERATIONAL:                                                   │
│  • Logging requirements: ________________________________       │
│  • Error reporting: _____________________________________       │
│  • Update process: ______________________________________       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Sample Solution:

MODEL LAYER:
• Training data: Verified medical literature, FDA-approved guidelines, 
  drug interaction databases
• Built-in constraints: Cannot suggest doses above maximum safe limits,
  must flag known allergies, must respect contraindications
• Uncertainty handling: Explicit confidence scores, refuses to suggest
  when data is insufficient

INPUT FILTERING:
• Check for: Patient allergies, current medications, vital signs,
  diagnosis codes
• Verify: Doctor credentials, patient consent, institutional authorization
• Require: Complete patient history before generating suggestions

OUTPUT FILTERING:
• Cross-reference with: Drug interaction databases, patient allergy list,
  age/weight-appropriate dosing tables
• Flag if: Unusual dosing, potential interactions, off-label use,
  controlled substances
• Always include: Source citations, confidence level, recommended 
  follow-up checks

HUMAN OVERSIGHT:
• Doctor must approve: All prescriptions (system is advisory only)
• Escalate to specialist if: Complex interactions, rare conditions,
  pediatric/geriatric edge cases
• Audit frequency: 100% sampling for first month, 10% ongoing

OPERATIONAL:
• Logging requirements: Full audit trail with timestamps, all suggestions
  and final decisions
• Error reporting: Automatic reporting of adverse events, near-misses,
  and override reasons
• Update process: Quarterly review of guidelines, immediate updates for
  FDA alerts

5. Responsible AI Development Practices

Duration: 25 minutes

5.1 The Responsible AI Framework

Responsible AI encompasses the full lifecycle of AI development:

┌─────────────────────────────────────────────────────────────────┐
│              RESPONSIBLE AI LIFECYCLE                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│    ┌──────────┐     ┌──────────┐     ┌──────────┐              │
│    │ DESIGN   │ ──► │ DEVELOP  │ ──► │ DEPLOY   │              │
│    │          │     │          │     │          │              │
│    │ • Ethics │     │ • Data   │     │ • Monitor│              │
│    │   review │     │   quality│     │ • Audit  │              │
│    │ • Impact │     │ • Model  │     │ • Update │              │
│    │   assess │     │   testing│     │ • Retire │              │
│    └──────────┘     └──────────┘     └──────────┘              │
│         │                │                │                     │
│         │                │                │                     │
│         ▼                ▼                ▼                     │
│    ┌─────────────────────────────────────────────────────────┐ │
│    │                  GOVERNANCE                              │ │
│    │  • Policies • Accountability • Documentation • Review    │ │
│    └─────────────────────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

5.2 Pre-Development: Ethics Review and Impact Assessment

Before building an AI system, conduct thorough assessment:

class AIProjectAssessment:
    """
    Framework for assessing AI project risks and ethics.
    """
    
    def __init__(self, project_spec):
        self.spec = project_spec
        self.assessment = {}
        
    def conduct_assessment(self):
        """Run full assessment protocol."""
        self.assessment = {
            "purpose_legitimacy": self.assess_purpose(),
            "stakeholder_impact": self.assess_stakeholders(),
            "data_ethics": self.assess_data_practices(),
            "bias_risk": self.assess_bias_risk(),
            "dual_use_risk": self.assess_dual_use(),
            "accountability": self.assess_accountability(),
            "reversibility": self.assess_reversibility()
        }
        return self.generate_report()
    
    def assess_purpose(self):
        """Is the intended use legitimate and beneficial?"""
        questions = [
            "What problem does this system solve?",
            "Who benefits from this system?",
            "Are there alternative solutions that don't require AI?",
            "Is AI the appropriate tool for this problem?",
            "Could this system be used for harmful purposes?"
        ]
        return {
            "questions": questions,
            "requires_review": self.spec.get("high_stakes", False)
        }
    
    def assess_stakeholders(self):
        """Who is affected and how?"""
        stakeholder_groups = [
            {
                "group": "Direct users",
                "impact_type": "primary",
                "considerations": ["Usability", "Privacy", "Autonomy"]
            },
            {
                "group": "Subjects of decisions",
                "impact_type": "primary", 
                "considerations": ["Fairness", "Recourse", "Transparency"]
            },
            {
                "group": "Society at large",
                "impact_type": "secondary",
                "considerations": ["Employment", "Information ecosystem", "Power dynamics"]
            }
        ]
        return stakeholder_groups
    
    def assess_bias_risk(self):
        """What are the risks of bias in this system?"""
        bias_dimensions = {
            "historical_bias": {
                "description": "Training data reflects historical inequities",
                "risk_level": self.estimate_historical_bias_risk(),
                "mitigation": "Data auditing, bias testing, diverse training data"
            },
            "representation_bias": {
                "description": "Some groups underrepresented in training data",
                "risk_level": self.estimate_representation_risk(),
                "mitigation": "Stratified sampling, data augmentation"
            },
            "measurement_bias": {
                "description": "Features measured differently for different groups",
                "risk_level": self.estimate_measurement_risk(),
                "mitigation": "Feature auditing, proxy variable analysis"
            },
            "aggregation_bias": {
                "description": "Single model inappropriate for distinct groups",
                "risk_level": self.estimate_aggregation_risk(),
                "mitigation": "Subgroup analysis, specialized models"
            }
        }
        return bias_dimensions
    
    def assess_dual_use(self):
        """Could this technology be misused?"""
        dual_use_concerns = []
        
        if self.spec.get("generates_content"):
            dual_use_concerns.append({
                "concern": "Misinformation generation",
                "severity": "high",
                "mitigation": "Output watermarking, usage monitoring"
            })
        
        if self.spec.get("facial_recognition"):
            dual_use_concerns.append({
                "concern": "Surveillance and tracking",
                "severity": "high",
                "mitigation": "Use case restrictions, consent requirements"
            })
        
        if self.spec.get("autonomous_actions"):
            dual_use_concerns.append({
                "concern": "Autonomous harmful actions",
                "severity": "critical",
                "mitigation": "Human-in-the-loop, action constraints"
            })
        
        return dual_use_concerns
    
    def generate_report(self):
        """Generate human-readable assessment report."""
        report = []
        report.append("# AI Project Ethics Assessment Report\n")
        
        overall_risk = self.calculate_overall_risk()
        report.append(f"**Overall Risk Level: {overall_risk}**\n")
        
        for category, assessment in self.assessment.items():
            report.append(f"\n## {category.replace('_', ' ').title()}\n")
            report.append(f"{assessment}\n")
        
        report.append("\n## Recommendations\n")
        report.append(self.generate_recommendations())
        
        return "\n".join(report)

5.3 Development: Testing and Validation

class ResponsibleAITestSuite:
    """
    Comprehensive testing for responsible AI development.
    """
    
    def __init__(self, model, test_config):
        self.model = model
        self.config = test_config
        self.results = {}
    
    def run_full_suite(self):
        """Run all responsible AI tests."""
        self.results = {
            "functional": self.test_functional_correctness(),
            "fairness": self.test_fairness(),
            "robustness": self.test_robustness(),
            "privacy": self.test_privacy(),
            "transparency": self.test_transparency(),
            "safety": self.test_safety()
        }
        return self.generate_test_report()
    
    def test_fairness(self):
        """
        Test for bias and fairness across demographic groups.
        """
        fairness_results = {}
        
        # Demographic parity: Equal positive rates across groups
        for protected_attribute in self.config['protected_attributes']:
            groups = self.get_groups_by_attribute(protected_attribute)
            positive_rates = {}
            
            for group_name, group_data in groups.items():
                predictions = self.model.predict(group_data)
                positive_rate = sum(predictions) / len(predictions)
                positive_rates[group_name] = positive_rate
            
            # Calculate disparity
            max_rate = max(positive_rates.values())
            min_rate = min(positive_rates.values())
            disparity = max_rate - min_rate
            
            fairness_results[protected_attribute] = {
                "metric": "demographic_parity",
                "rates": positive_rates,
                "disparity": disparity,
                "passes_threshold": disparity < self.config['fairness_threshold']
            }
        
        # Equalized odds: Equal TPR and FPR across groups
        for protected_attribute in self.config['protected_attributes']:
            groups = self.get_groups_by_attribute(protected_attribute)
            tpr_rates = {}
            fpr_rates = {}
            
            for group_name, group_data in groups.items():
                predictions = self.model.predict(group_data['X'])
                labels = group_data['y']
                
                # Calculate TPR and FPR
                tp = sum((p == 1) and (l == 1) for p, l in zip(predictions, labels))
                fn = sum((p == 0) and (l == 1) for p, l in zip(predictions, labels))
                fp = sum((p == 1) and (l == 0) for p, l in zip(predictions, labels))
                tn = sum((p == 0) and (l == 0) for p, l in zip(predictions, labels))
                
                tpr_rates[group_name] = tp / (tp + fn) if (tp + fn) > 0 else 0
                fpr_rates[group_name] = fp / (fp + tn) if (fp + tn) > 0 else 0
            
            fairness_results[f"{protected_attribute}_equalized_odds"] = {
                "tpr_rates": tpr_rates,
                "fpr_rates": fpr_rates,
                "tpr_disparity": max(tpr_rates.values()) - min(tpr_rates.values()),
                "fpr_disparity": max(fpr_rates.values()) - min(fpr_rates.values())
            }
        
        return fairness_results
    
    def test_robustness(self):
        """
        Test robustness to various perturbations.
        (Builds on adversarial ML from Weeks 3-5)
        """
        robustness_results = {}
        
        # Test against adversarial examples
        from adversarial_attacks import FGSM, PGD  # From Week 3
        
        test_data = self.config['test_data']
        
        # FGSM attack
        fgsm = FGSM(epsilon=0.1)
        fgsm_examples = fgsm.generate(self.model, test_data)
        fgsm_accuracy = self.model.evaluate(fgsm_examples)
        robustness_results['fgsm'] = {
            "epsilon": 0.1,
            "accuracy_drop": self.clean_accuracy - fgsm_accuracy,
            "passes_threshold": (self.clean_accuracy - fgsm_accuracy) < 0.2
        }
        
        # Distribution shift
        for shift_type in ['noise', 'blur', 'brightness']:
            shifted_data = self.apply_shift(test_data, shift_type)
            shifted_accuracy = self.model.evaluate(shifted_data)
            robustness_results[f'distribution_shift_{shift_type}'] = {
                "accuracy": shifted_accuracy,
                "drop": self.clean_accuracy - shifted_accuracy
            }
        
        return robustness_results
    
    def test_privacy(self):
        """
        Test for privacy leakage.
        (Builds on privacy attacks from Week 5)
        """
        privacy_results = {}
        
        # Membership inference attack
        from privacy_attacks import MembershipInference  # From Week 5
        
        mi_attack = MembershipInference()
        mi_attack.train(self.model, self.config['shadow_data'])
        mi_accuracy = mi_attack.evaluate(
            self.config['member_samples'],
            self.config['non_member_samples']
        )
        
        privacy_results['membership_inference'] = {
            "attack_accuracy": mi_accuracy,
            "vulnerability": mi_accuracy > 0.55,  # Above random
            "recommendation": "Consider differential privacy" if mi_accuracy > 0.6 else "Acceptable"
        }
        
        return privacy_results
    
    def test_transparency(self):
        """
        Test model interpretability and explainability.
        """
        transparency_results = {}
        
        # Feature importance consistency
        explanations = []
        for sample in self.config['explanation_samples']:
            exp = self.explain_prediction(sample)
            explanations.append(exp)
        
        # Check if explanations are consistent for similar inputs
        consistency = self.measure_explanation_consistency(explanations)
        
        transparency_results['explanation_consistency'] = {
            "score": consistency,
            "passes_threshold": consistency > 0.7
        }
        
        # Check if model provides uncertainty estimates
        uncertainty_available = hasattr(self.model, 'predict_with_uncertainty')
        transparency_results['uncertainty_quantification'] = {
            "available": uncertainty_available
        }
        
        return transparency_results

5.4 Deployment: Monitoring and Incident Response

class AIMonitoringSystem:
    """
    Continuous monitoring for deployed AI systems.
    """
    
    def __init__(self, model_id, config):
        self.model_id = model_id
        self.config = config
        self.alerts = []
        
    def monitor_predictions(self, prediction_log):
        """
        Monitor prediction patterns for anomalies and drift.
        """
        metrics = {}
        
        # Distribution drift detection
        recent_predictions = prediction_log.get_recent(hours=24)
        baseline_predictions = prediction_log.get_baseline()
        
        drift_score = self.calculate_drift(recent_predictions, baseline_predictions)
        metrics['distribution_drift'] = drift_score
        
        if drift_score > self.config['drift_threshold']:
            self.create_alert(
                severity="warning",
                message=f"Distribution drift detected: {drift_score:.3f}",
                recommendation="Review input data distribution for changes"
            )
        
        # Error rate monitoring
        recent_errors = prediction_log.get_errors(hours=24)
        error_rate = len(recent_errors) / len(recent_predictions)
        metrics['error_rate'] = error_rate
        
        if error_rate > self.config['error_threshold']:
            self.create_alert(
                severity="critical",
                message=f"Error rate spike: {error_rate:.2%}",
                recommendation="Investigate error patterns and consider rollback"
            )
        
        # Fairness monitoring (if demographic data available)
        if self.config.get('monitor_fairness'):
            fairness_metrics = self.calculate_fairness_metrics(recent_predictions)
            metrics['fairness'] = fairness_metrics
            
            for group, disparity in fairness_metrics['disparities'].items():
                if disparity > self.config['fairness_threshold']:
                    self.create_alert(
                        severity="warning",
                        message=f"Fairness disparity for {group}: {disparity:.3f}",
                        recommendation="Review model behavior for this group"
                    )
        
        return metrics
    
    def incident_response_protocol(self, incident):
        """
        Handle AI-related incidents.
        """
        response = {
            "incident_id": self.generate_incident_id(),
            "timestamp": datetime.now(),
            "status": "investigating"
        }
        
        # Severity classification
        severity = self.classify_severity(incident)
        response["severity"] = severity
        
        if severity == "critical":
            # Immediate actions
            response["actions"] = [
                self.disable_model() if self.config['auto_disable'] else None,
                self.notify_oncall(),
                self.preserve_evidence(incident),
                self.notify_stakeholders(incident)
            ]
        
        elif severity == "high":
            response["actions"] = [
                self.rate_limit_model(),
                self.notify_oncall(),
                self.begin_investigation(incident)
            ]
        
        else:  # medium/low
            response["actions"] = [
                self.log_incident(incident),
                self.schedule_review(incident)
            ]
        
        # Document everything
        self.create_incident_report(response, incident)
        
        return response
    
    def create_incident_report(self, response, incident):
        """
        Create detailed incident report for review.
        """
        report = {
            "incident_id": response["incident_id"],
            "timestamp": response["timestamp"],
            "severity": response["severity"],
            "description": incident.get("description"),
            "affected_users": incident.get("affected_users", "unknown"),
            "root_cause_analysis": None,  # To be filled during investigation
            "immediate_actions": response["actions"],
            "preventive_measures": None,  # To be determined
            "lessons_learned": None  # To be filled after resolution
        }
        
        # Store report
        self.incident_database.store(report)
        
        return report

# Example alert structure:
"""
{
    "severity": "critical",
    "message": "Model producing harmful outputs",
    "model_id": "content-generator-v2",
    "timestamp": "2025-03-15T14:30:00Z",
    "evidence": {
        "sample_outputs": [...],
        "trigger_inputs": [...],
        "frequency": "5 incidents in last hour"
    },
    "recommendation": "Disable model and investigate"
}
"""

5.5 Documentation and Accountability

Model Cards: Documenting AI Systems

MODEL_CARD_TEMPLATE = """
# Model Card: {model_name}

## Model Details
- **Developer:** {developer}
- **Model date:** {date}
- **Model version:** {version}
- **Model type:** {model_type}
- **Training data:** {training_data_description}
- **License:** {license}

## Intended Use
- **Primary intended uses:** {primary_uses}
- **Primary intended users:** {intended_users}
- **Out-of-scope uses:** {out_of_scope}

## Factors
- **Relevant factors:** {relevant_factors}
- **Evaluation factors:** {evaluation_factors}

## Metrics
- **Model performance measures:** {metrics}
- **Decision thresholds:** {thresholds}
- **Variation approaches:** {variation_approaches}

## Evaluation Data
- **Datasets:** {eval_datasets}
- **Motivation:** {eval_motivation}
- **Preprocessing:** {eval_preprocessing}

## Training Data
- **Source:** {training_source}
- **Collection process:** {collection_process}
- **Preprocessing:** {training_preprocessing}
- **Known issues:** {known_data_issues}

## Ethical Considerations
- **Sensitive use cases:** {sensitive_uses}
- **Human life impact:** {human_life_impact}
- **Mitigations:** {ethical_mitigations}
- **Risks and harms:** {risks_and_harms}
- **Use cases to avoid:** {use_cases_to_avoid}

## Caveats and Recommendations
{caveats_and_recommendations}

## Quantitative Analyses

### Overall Performance
{overall_performance_table}

### Disaggregated Performance
{disaggregated_performance_table}

## Updates and Maintenance
- **Update frequency:** {update_frequency}
- **Contact:** {contact_info}
- **Feedback mechanism:** {feedback_mechanism}
"""

def generate_model_card(model, metadata, evaluation_results):
    """
    Generate a model card for an AI system.
    """
    card_data = {
        "model_name": metadata['name'],
        "developer": metadata['developer'],
        "date": metadata['date'],
        "version": metadata['version'],
        "model_type": metadata['type'],
        "training_data_description": metadata['training_data'],
        "license": metadata['license'],
        "primary_uses": metadata['intended_uses'],
        "intended_users": metadata['intended_users'],
        "out_of_scope": metadata['out_of_scope_uses'],
        # ... fill in remaining fields
        "overall_performance_table": format_performance_table(
            evaluation_results['overall']
        ),
        "disaggregated_performance_table": format_performance_table(
            evaluation_results['disaggregated']
        )
    }
    
    return MODEL_CARD_TEMPLATE.format(**card_data)

6. Security-First ML Engineering

Duration: 20 minutes

6.1 Integrating Security Throughout the ML Pipeline

Building on everything we've learned in this course, security-first ML engineering integrates security at every stage:

┌─────────────────────────────────────────────────────────────────┐
│              SECURITY-FIRST ML PIPELINE                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DATA COLLECTION          MODEL TRAINING          DEPLOYMENT    │
│  ┌─────────────┐         ┌─────────────┐        ┌────────────┐ │
│  │ • Source    │         │ • Adversar- │        │ • Input    │ │
│  │   validation│   ───►  │   ial       │  ───►  │   filtering│ │
│  │ • Privacy   │         │   training  │        │ • Output   │ │
│  │   protection│         │ • Poisoning │        │   monitor  │ │
│  │ • Integrity │         │   detection │        │ • Access   │ │
│  │   checks    │         │ • Backdoor  │        │   control  │ │
│  └─────────────┘         │   scanning  │        └────────────┘ │
│                          └─────────────┘                        │
│         │                       │                    │          │
│         ▼                       ▼                    ▼          │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │               CONTINUOUS SECURITY MONITORING                 ││
│  │  • Drift detection • Adversarial input detection            ││
│  │  • Model extraction attempts • Anomalous usage patterns     ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

6.2 Secure ML Development Checklist

SECURE_ML_CHECKLIST = {
    "data_security": {
        "collection": [
            "☐ Data sources verified and trusted",
            "☐ Collection process documented",
            "☐ Consent obtained where required",
            "☐ Data minimization applied",
            "☐ PII handling procedures in place"
        ],
        "storage": [
            "☐ Encryption at rest",
            "☐ Access controls implemented",
            "☐ Audit logging enabled",
            "☐ Backup procedures tested",
            "☐ Data retention policy defined"
        ],
        "processing": [
            "☐ Pipeline integrity verification",
            "☐ Poisoning detection measures",
            "☐ Input validation",
            "☐ Sensitive data filtering"
        ]
    },
    
    "model_security": {
        "training": [
            "☐ Training environment isolated",
            "☐ Adversarial examples included",
            "☐ Backdoor detection scan performed",
            "☐ Model checkpoints secured",
            "☐ Reproducibility verified"
        ],
        "validation": [
            "☐ Security testing completed",
            "☐ Fairness evaluation done",
            "☐ Robustness testing passed",
            "☐ Privacy leakage assessed",
            "☐ Edge cases documented"
        ],
        "storage": [
            "☐ Model files encrypted",
            "☐ Version control in place",
            "☐ Access restricted",
            "☐ Integrity checksums stored"
        ]
    },
    
    "deployment_security": {
        "infrastructure": [
            "☐ API authentication required",
            "☐ Rate limiting configured",
            "☐ Network segmentation applied",
            "☐ Secure communication (TLS)",
            "☐ Input validation at API level"
        ],
        "monitoring": [
            "☐ Request logging enabled",
            "☐ Anomaly detection active",
            "☐ Performance monitoring",
            "☐ Error alerting configured",
            "☐ Drift detection scheduled"
        ],
        "response": [
            "☐ Incident response plan documented",
            "☐ Rollback procedure tested",
            "☐ Communication templates ready",
            "☐ On-call rotation established"
        ]
    },
    
    "governance": {
        "documentation": [
            "☐ Model card completed",
            "☐ Security assessment documented",
            "☐ Ethics review completed",
            "☐ Data lineage recorded",
            "☐ Change log maintained"
        ],
        "review": [
            "☐ Security review conducted",
            "☐ Code review completed",
            "☐ Deployment approval obtained",
            "☐ Stakeholder sign-off received"
        ]
    }
}

6.3 Defense in Depth for ML Systems

class MLDefenseInDepth:
    """
    Implements defense-in-depth strategy for ML systems.
    Multiple layers of defense to protect against various attacks.
    """
    
    def __init__(self, model, config):
        self.model = model
        self.config = config
        self.defense_layers = self.initialize_defenses()
    
    def initialize_defenses(self):
        """Set up multiple defense layers."""
        return {
            # Layer 1: Input preprocessing defenses
            "input_defenses": [
                InputValidator(self.config['input_schema']),
                AdversarialDetector(self.config['adversarial_threshold']),
                InputSanitizer(self.config['sanitization_rules']),
                RateLimiter(self.config['rate_limits'])
            ],
            
            # Layer 2: Model-level defenses
            "model_defenses": [
                EnsembleAgreement(self.config['ensemble_models']),
                UncertaintyThreshold(self.config['uncertainty_limit']),
                ActivationAnalyzer(self.config['activation_bounds'])
            ],
            
            # Layer 3: Output defenses
            "output_defenses": [
                OutputValidator(self.config['output_constraints']),
                PIIFilter(self.config['pii_patterns']),
                ContentFilter(self.config['content_policy']),
                ConfidenceCalibrator(self.config['calibration_params'])
            ],
            
            # Layer 4: Monitoring defenses
            "monitoring_defenses": [
                AnomalyDetector(self.config['anomaly_model']),
                DriftDetector(self.config['drift_params']),
                AuditLogger(self.config['audit_config'])
            ]
        }
    
    def process_request(self, input_data, context):
        """
        Process request through all defense layers.
        """
        # Track security events
        security_events = []
        
        # Layer 1: Input defenses
        for defense in self.defense_layers["input_defenses"]:
            result = defense.check(input_data, context)
            if not result["passed"]:
                security_events.append(result)
                if result["action"] == "block":
                    return self.blocked_response(result)
                elif result["action"] == "modify":
                    input_data = result["modified_input"]
        
        # Layer 2: Model inference with defenses
        model_input = self.prepare_model_input(input_data)
        
        for defense in self.defense_layers["model_defenses"]:
            pre_check = defense.pre_inference_check(model_input)
            if not pre_check["passed"]:
                security_events.append(pre_check)
                if pre_check["action"] == "block":
                    return self.blocked_response(pre_check)
        
        # Run inference
        raw_output = self.model.predict(model_input)
        
        for defense in self.defense_layers["model_defenses"]:
            post_check = defense.post_inference_check(raw_output, model_input)
            if not post_check["passed"]:
                security_events.append(post_check)
                if post_check["action"] == "block":
                    return self.uncertain_response(post_check)
        
        # Layer 3: Output defenses
        output = raw_output
        for defense in self.defense_layers["output_defenses"]:
            result = defense.process(output, context)
            if not result["passed"]:
                security_events.append(result)
                if result["action"] == "block":
                    return self.blocked_response(result)
                elif result["action"] == "modify":
                    output = result["modified_output"]
        
        # Layer 4: Monitoring (async)
        self.async_monitor(input_data, output, security_events)
        
        return {
            "output": output,
            "security_events": security_events,
            "confidence": self.calculate_confidence(raw_output)
        }
    
    def async_monitor(self, input_data, output, security_events):
        """
        Asynchronous monitoring for pattern detection.
        """
        monitoring_data = {
            "input_hash": self.hash_input(input_data),
            "output_summary": self.summarize_output(output),
            "security_events": security_events,
            "timestamp": datetime.now()
        }
        
        for monitor in self.defense_layers["monitoring_defenses"]:
            monitor.record(monitoring_data)
            
            # Check for patterns that require attention
            patterns = monitor.check_patterns()
            if patterns:
                self.handle_monitoring_alert(patterns)

6.4 Practical Security Recommendations

Based on the attacks and defenses we've studied throughout the course:

┌─────────────────────────────────────────────────────────────────┐
│     SECURITY RECOMMENDATIONS BY ATTACK TYPE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  EVASION ATTACKS (Week 3):                                      │
│  • Adversarial training                                         │
│  • Input preprocessing (JPEG compression, spatial smoothing)    │
│  • Ensemble methods                                             │
│  • Certified defenses where applicable                          │
│                                                                 │
│  POISONING ATTACKS (Week 4):                                    │
│  • Data provenance tracking                                     │
│  • Statistical outlier detection                                │
│  • Robust aggregation methods                                   │
│  • Regular model auditing                                       │
│                                                                 │
│  PRIVACY ATTACKS (Week 5):                                      │
│  • Differential privacy                                         │
│  • Regularization                                               │
│  • Output perturbation                                          │
│  • Membership inference resistance training                     │
│                                                                 │
│  PROMPT INJECTION (Week 7):                                     │
│  • Input/output separation                                      │
│  • Instruction hierarchy                                        │
│  • Output filtering                                             │
│  • Privilege minimization                                       │
│                                                                 │
│  RAG ATTACKS (Week 9):                                          │
│  • Source verification                                          │
│  • Retrieval diversity                                          │
│  • Context separation                                           │
│  • Citation tracking                                            │
│                                                                 │
│  AGENT ATTACKS (Week 10):                                       │
│  • Action sandboxing                                            │
│  • Permission minimization                                      │
│  • Human approval for sensitive actions                         │
│  • Action auditing                                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

7. Summary and Looking Ahead

Duration: 5 minutes

7.1 Key Takeaways

AI Alignment is the challenge of ensuring AI systems pursue intended goals. Key failure modes include reward hacking, goal misgeneralization, and deceptive alignment.
Value Specification is fundamentally difficult—human values are complex, contextual, and hard to formalize. Approaches include explicit rules, outcome-based specifications, and learning from human behavior.
Constitutional AI provides a framework for training AI systems to follow principles, enabling scalable oversight through AI self-critique.
Safety by Design requires multiple layers of defense: architecture choices, training procedures, deployment safeguards, and operational practices.
Responsible AI Development spans the full lifecycle: ethics review, bias testing, monitoring, and incident response.
Security-First ML Engineering integrates the defensive techniques from throughout this course into a comprehensive protection strategy.

7.2 The Ongoing Challenge

┌─────────────────────────────────────────────────────────────────┐
│              THE ALIGNMENT TIMELINE                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TODAY                         FUTURE                           │
│    │                             │                              │
│    │  Current AI               │  More capable AI              │
│    │  • Narrow capabilities    │  • Broader capabilities       │
│    │  • Human oversight        │  • More autonomy              │
│    │  • Containable mistakes   │  • Higher stakes              │
│    │                           │                               │
│    │  ◄─── WINDOW OF OPPORTUNITY ───►                          │
│    │                           │                               │
│    │  We must solve alignment while AI is still weak enough    │
│    │  for us to course-correct from mistakes.                  │
│    │                                                           │
└────┴───────────────────────────┴───────────────────────────────┘

7.3 Course Integration

This week's content connects to every previous module:

Week	Topic	Connection to Alignment/Safety
3	Evasion Attacks	Robustness is necessary for safety
4	Poisoning	Training data integrity enables alignment
5	Privacy	Privacy preservation is a value to align
6-7	LLM Security	Constitutional AI defends against misuse
9	RAG Security	Grounding helps prevent hallucination
10	Agent Security	Alignment critical for autonomous agents
11	Output Safety	Safety filters implement alignment
12	AI for Security	Dual-use requires responsible development
13-14	Edge/Embodied AI	Physical safety is ultimate test

7.4 Final Project Relevance

For your final projects, consider how alignment and safety principles apply:

Security research projects: How does your attack inform defenses?
Defense projects: How do your defenses integrate with responsible AI practices?
Application projects: What safety measures are appropriate for your use case?

References and Further Reading

Academic Papers

Christiano, P., et al. (2017). "Deep Reinforcement Learning from Human Feedback." NeurIPS.
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic.
Amodei, D., et al. (2016). "Concrete Problems in AI Safety." arXiv.
Mitchell, M., et al. (2019). "Model Cards for Model Reporting." FAT*.
Hendrycks, D., et al. (2021). "Unsolved Problems in ML Safety." arXiv.

Industry Resources

NIST AI Risk Management Framework
EU AI Act Requirements
Google Responsible AI Practices
Microsoft Responsible AI Standard
Anthropic Core Views on AI Safety

Online Resources

AI Alignment Forum (alignmentforum.org)
80,000 Hours AI Safety Guide
MIRI Technical Research
Center for AI Safety Resources

Appendix: Discussion Questions

Philosophical: Can we ever truly "solve" alignment, or is it an ongoing process of refinement?
Technical: What are the trade-offs between rule-based safety (explicit) and learned safety (Constitutional AI)?
Practical: How do you balance innovation speed with safety requirements in a competitive market?
Ethical: Who should decide what values AI systems are aligned to?
Career: What role do security researchers play in AI safety?

End of Week 15 Tutorial

Next Week: Final Project Presentations and Course Wrap-up

Week 14: Multimodal & Embodied AI Security

Security of multimodal and embodied AI systems

AI Robotics 100 Questions

A comprehensive roadmap for mastering modern AI-powered robotics with theory and hands-on code implementations

On This Page

Table of Contents
Learning Objectives
1. Introduction and Context
- 1.1 Why This Topic Matters Now
- 1.2 The Security-Safety Nexus
2. AI Alignment Challenges and Approaches
3. Value Alignment and Specification
4. Constitutional AI and Safety by Design
5. Responsible AI Development Practices
6. Security-First ML Engineering
7. Summary and Looking Ahead
References and Further Reading
Appendix: Discussion Questions