Inbox Triage Classifier: 87% Top-1 Agreement With Human Triager on n=1,200 Replies

FILE α-02·INBOX · ops·May 06, 2026

Inbox Triage Classifier: 87% Top-1 Agreement With Human Triager on n=1,200 Replies

Dani Shvarts — enso Lab, INBOX group

§1 · Abstract

We evaluated an LLM-based triage classifier that partitions inbound replies into three classes - warm, needs-human, ignore - and emits a calibrated confidence score on each call. On a labelled set of n=1,200 replies it achieves 87% agreement with our human triager. Error analysis shows residual disagreement is dominated by hedged-affirmative language, a tractable failure mode.

Statusinternal use only
Clearanceα-02
SurfaceINBOX · ops
Read3 min read

Stack5 components

GPT-4o-minibucket + confidence calls
Zodstructured-output schema
Gmail APIinbound reply ingestion
Supabaselabelled-reply store + audit log
Slackneeds-human routing

§2 · Hypothesis

H1: A small classifier can reliably separate intent-positive replies from soft-decline and opt-out replies with low false-negative rate. Conditional on H1, the operator inspects ~1/3 of the inbox at no cost to recall on warm intent.

§3 · Materials

Corpus: n=1,200 inbound replies sampled from inboxes operated by enso AI agents
Label schema: three mutually exclusive classes - WARM, NEEDS_HUMAN, IGNORE
Per-call confidence score used as the operative routing signal
Validation: held-out set of n=200 replies excluded from tuning

§4 · Procedure

The end-to-end recipe. Follow it top to bottom; each step assumes the previous one ran cleanly.

Specify the label schema in writing
Prior to any model work we authored a one-paragraph operational definition for each class, accompanied by two exemplars and two counter-exemplars. Label ambiguity is the primary upstream determinant of classifier quality; under-specified classes propagate directly into routing error.
Enforce a structured response schema
Outputs are constrained to a Zod-validated schema: class, confidence, one-sentence rationale. The rationale is primarily a diagnostic for reviewers, but empirically it also improves classification quality - consistent with chain-of-thought elicitation effects.
Step03
Route on confidence, not on predicted class
Any prediction with confidence <0.65 is escalated to a human reviewer irrespective of predicted class. The predicted class is a prior; the confidence score is the operative decision variable.
Fig.How a reply moves through the agent
1. 01Reply arrivesinbound thread
2. 02Agent reads itbucket + confidence
3. 03Confidence < 65%?→ human
4. 04Route or queuewarm / ignore
Weekly disagreement sampling
We sample n=20 model–human disagreements per week for qualitative review. The dominant failure mode is hedged-affirmative language (e.g. 'maybe next quarter'), which directly informs the next training iteration.

§5 · Results

Fig.Agent vs. human triager - 1,200 replies

Agent and human agree87%
Agent leaned warm, human said 'needs a human'10%
Other disagreement3%

87%

human agreement

1,200

labelled replies

triage buckets

94%

agreement when 'very confident'

Top-1 agreement with the human triager: 87% on n=1,200 - operationally sufficient given the confidence-based escalation rule.
Residual disagreements are concentrated in hedged-affirmative language ('not now but later'). A targeted fix is in development.
Calibration at the upper tail is well-behaved: predictions with stated confidence ≥0.90 achieve empirical accuracy of 94%.

§6 · Discussion

At the individual-operator level the classifier reduces an ~200-message review window to ~60. At the team level the more material effect is consistency: variance across operators collapses, and the warm-pipeline classification becomes self-consistent across the AI-agent cohort.

§7 · Reproduce it yourself

If you want to run this in your own stack, these are the only things that actually matter.

Author the label schema before any modelling
If a class cannot be characterised by two exemplars and two counter-exemplars, the class is under-specified and no downstream model will compensate.
Always require a rationale field
The rationale field primarily serves reviewers but is also associated with measurable gains in classification quality.
Treat confidence as the primary output
The predicted class is a prior; the confidence score is the operative decision variable that determines whether a human reviews the reply.

§8 · References

[1]Internal: INBOX ops surface spec
[2]Field notes: hedged-warm replies at enso (2026)

Back to all experiments

Filed byDani Shvarts

Field research on attention, outbound, and the surfaces buyers actually live in.

More experiments →

enso - Agentic Growth Lab

What enso does

Pricing

Compare

Developer resources

About the team

Inbox Triage Classifier: 87% Top-1 Agreement With Human Triager on n=1,200 Replies

Inbox Triage Classifier: 87% Top-1 Agreement With Human Triager on n=1,200 Replies

Specify the label schema in writing

Enforce a structured response schema

Route on confidence, not on predicted class

Weekly disagreement sampling

Author the label schema before any modelling

Always require a rationale field

Treat confidence as the primary output

A secret door.
Ask one of these keepers for the key to step inside.