Inbox Triage Classifier: 87% Top-1 Agreement With Human Triager on n=1,200 Replies
We evaluated an LLM-based triage classifier that partitions inbound replies into three classes — warm, needs-human, ignore — and emits a calibrated confidence score on each call. On a labelled set of n=1,200 replies it achieves 87% agreement with our human triager. Error analysis shows residual disagreement is dominated by hedged-affirmative language, a tractable failure mode.
- Statusinternal use only
- Clearanceα-02
- SurfaceINBOX · ops
- Read3 min read
- GPT-4o-minibucket + confidence calls
- Zodstructured-output schema
- Gmail APIinbound reply ingestion
- Supabaselabelled-reply store + audit log
- Slackneeds-human routing
H1: A small classifier can reliably separate intent-positive replies from soft-decline and opt-out replies with low false-negative rate. Conditional on H1, the operator inspects ~1/3 of the inbox at no cost to recall on warm intent.
- Corpus: n=1,200 inbound replies sampled from inboxes operated by enso AI agents
- Label schema: three mutually exclusive classes — WARM, NEEDS_HUMAN, IGNORE
- Per-call confidence score used as the operative routing signal
- Validation: held-out set of n=200 replies excluded from tuning
The end-to-end recipe. Follow it top to bottom; each step assumes the previous one ran cleanly.
Specify the label schema in writing
Prior to any model work we authored a one-paragraph operational definition for each class, accompanied by two exemplars and two counter-exemplars. Label ambiguity is the primary upstream determinant of classifier quality; under-specified classes propagate directly into routing error.
Enforce a structured response schema
Outputs are constrained to a Zod-validated schema: class, confidence, one-sentence rationale. The rationale is primarily a diagnostic for reviewers, but empirically it also improves classification quality — consistent with chain-of-thought elicitation effects.
Route on confidence, not on predicted class
Any prediction with confidence <0.65 is escalated to a human reviewer irrespective of predicted class. The predicted class is a prior; the confidence score is the operative decision variable.
Fig.How a reply moves through the agent - 01Reply arrivesinbound thread
- 02Agent reads itbucket + confidence
- 03Confidence < 65%?→ human
- 04Route or queuewarm / ignore
Weekly disagreement sampling
We sample n=20 model–human disagreements per week for qualitative review. The dominant failure mode is hedged-affirmative language (e.g. 'maybe next quarter'), which directly informs the next training iteration.
- Agent and human agree87%
- Agent leaned warm, human said 'needs a human'10%
- Other disagreement3%
- Top-1 agreement with the human triager: 87% on n=1,200 — operationally sufficient given the confidence-based escalation rule.
- Residual disagreements are concentrated in hedged-affirmative language ('not now but later'). A targeted fix is in development.
- Calibration at the upper tail is well-behaved: predictions with stated confidence ≥0.90 achieve empirical accuracy of 94%.
At the individual-operator level the classifier reduces an ~200-message review window to ~60. At the team level the more material effect is consistency: variance across operators collapses, and the warm-pipeline classification becomes self-consistent across the AI-agent cohort.
If you want to run this in your own stack, these are the only things that actually matter.
Author the label schema before any modelling
If a class cannot be characterised by two exemplars and two counter-exemplars, the class is under-specified and no downstream model will compensate.
Always require a rationale field
The rationale field primarily serves reviewers but is also associated with measurable gains in classification quality.
Treat confidence as the primary output
The predicted class is a prior; the confidence score is the operative decision variable that determines whether a human reviews the reply.
- [1]Internal: INBOX ops surface spec
- [2]Field notes: hedged-warm replies at enso (2026)








