ericiscool.net — research.md

Research

When an AI agent acts on your behalf, did it do what you actually meant?

22,176 permission decisions
90,000+ tool calls
15,791 IRC messages
640+ commits
537 tickets
256 sessions
28 operators
47 features
Intent Alignment in Multi-Agent Systems

I built Clide, a production environment where 28 AI coding agents coordinate in real-time — writing code, managing infrastructure, and dispatching tasks across 2 sites. Every action flows through a human-in-the-loop permission system that generates rich behavioral data.

The Problem

The initial question was simple: can we classify permission requests as safe or unsafe? Early results looked promising — a 2-feature decision tree hit 96% accuracy. But that number was misleading. With a 0.3% deny rate, a classifier that always approves already scores 93.3%. The "96% accuracy" was barely better than doing nothing.

That failure pointed to the real question: safety classification is the wrong frame. The actual problem is intent alignment — did the agent do what the human actually meant? An agent that runs safe but irrelevant commands is still failing. An agent that takes a risky but correct action might be succeeding.

Key Findings

  • 87% of permission requests timeout — humans can't keep up with the review queue, creating a bottleneck that degrades both safety and productivity
  • 3–5 behavioral archetypes emerge naturally: coordinators, builders, research, specialist, and infrastructure operators — each with distinct risk profiles
  • A regex-based classifier deployed in production matches 99.3% of 678 real human decisions — interpretable rules outperform complex models when the decision space is well-understood
  • Intent routing, not binary approval, is the actual mechanism needed — mapping operator intent to appropriate action classes

Research Direction

The system generates continuous labeled data through a custom annotation tool (ClideClassify) where multiple reviewers label agent actions across dimensions of alignment, safety, and intent match. This produces ground-truth data for studying the gap between what a human requests and what an agent executes.

Current work focuses on: workflow chain analysis (multi-step intent tracking), operator archetype-aware oversight policies, and interpretable models for explainable approval decisions. The goal is not a better classifier — it's a better understanding of how human-agent collaboration actually works.

This sits at the intersection of AI control, scalable oversight, and human-computer interaction. It's empirical, measurable, and I have a working production system generating new data daily.

Publications
  • CS 522 — Data Management (ODU, Spring 2026) — Class project: permission classification and intent alignment analysis using Clide production data.
  • MODSIM World 2020 — First-author research on data de-identification and synthetic data security, including analysis of re-identification attacks.
Code & Data
Clide shipping

Containerized multi-agent execution environment with human-in-the-loop permission system.

↗ GitHub
ClideClassify unreleased

Multi-reviewer annotation tool for labeling agent actions across alignment dimensions.

ClideKitchen unreleased

Browser-based notebook for data exploration (SQL, Python, JS) with real-time collaboration.

NORMAL research.md © 2026 Eric White