Core Idea
- Superintelligence with misaligned goals = existential catastrophe—intelligence and objectives are independent; a superintelligence can be arbitrarily smart while pursuing arbitrary (or catastrophic) goals
- You get one shot: Control must be implemented before superintelligence achieves decisive strategic advantage; failure is terminal
- Motivation matters more than containment: All capability-control methods (boxing, incentives, stunting) break at superintelligent levels; focus instead on getting goals right before deployment
Why Current Safety Approaches Fail
- Boxing is breakable: Physical/informational containment has exploitable gaps; superintelligence can hack gatekeepers or escape through subtle channels
- Incentive methods collapse: Social integration or reward systems only work if you maintain power balance—impossible against decisive strategic advantage
- Direct value specification creates perverse outcomes: Trying to code "human happiness" produces wireheading, brain electrode implants, resource waste—not actual wellbeing
What Actually Works: Motivation Selection
- Oracle AI (safest): Question-answering only; allows containment; vulnerable to operator misuse but contained in scope
- Genie AI: Command-executor; harder to contain but moderate safety improvement
- Coherent Extrapolated Volition (CEV): Let AI determine what idealized humans would actually want, constrained by moral bounds—offloads philosophy without direct goal-coding
- Accept uncertainty: Build systems that shut down if uncertain about their own correctness rather than risk misalignment
Pre-Superintelligence Actions (Critical Now)
- Prevent races to superintelligence: Competition reduces safety investment; establish international coordination norms before projects reach critical stages
- Differential technological development: Slow dangerous AI without safety solutions; accelerate safety research and human cognitive enhancement for better decision-makers
- Fund strategic analysis over demos: Identify "crucial considerations" that could overturn current strategy; better to get direction right than move fast
- Build institutions that self-correct: Recruit researchers willing to abandon failed approaches, not defend them; avoid ideological commitment to flawed strategies
Institutional Safeguards
- Multiple overlapping control layers: No single mechanism sufficient; combine capability limits, motivation selection, continuous monitoring
- "Common good principle": Get major AI organizations to publicly commit superintelligence benefits will be widely shared, not monopolized by first developers
- Windfall clauses: Pledge that profits beyond a threshold get distributed universally—makes cooperation attractive without requiring upfront sacrifice
- Reversibility: Test AI decisions in sandbox before execution; store system checkpoints to revert if behavior corrupts
- Transparency requirements: Demand AI systems expose source code, decision rationale, and vulnerability to manipulation
Action Plan
- Immediately: Fund AI safety research and interdisciplinary "crucial considerations" analysis—not product demos
- Near-term: Establish international coordination mechanisms and public commitments to shared superintelligence benefits before competitive races accelerate
- Pre-deployment: Select motivation-based control (CEV or moral bounds) over capability containment; implement multi-layer safeguards and reversibility
- Governance: Build institutions with self-correcting researchers and transparency requirements; avoid ideologically committed teams
- Recognize the stakes: This is humanity's most important coordination problem—get direction right before superintelligence reaches decisive advantage
