Designing a Human-in-the-Loop Review Pipeline
Building a pipeline where AI generates first drafts and humans review before content reaches users — job queues, confidence thresholds, and feedback loops.
Context
We added an AI feature that generated draft content (summaries, categorizations) for user review. The AI wasn't accurate enough to ship directly; humans had to approve or edit before anything went live. The challenge was designing a pipeline that handled the queue, routed work to reviewers, supported confidence-based auto-approval for high-confidence outputs, and fed corrections back so we could improve the model over time.
Constraints
- Reviewers are a bottleneck — we needed to auto-approve when confidence was high enough
- Feedback loop — rejections and edits had to be captured for future model improvement
- Multi-tenant — each tenant might have different review workflows and thresholds
Architecture
We built a job queue (PostgreSQL-backed, with a simple polling worker) where each AI-generated item has a state: pending_review, approved, rejected, needs_edit. The AI service writes to the queue with a confidence score. A configurable threshold per tenant determines auto-approval — above 0.9 we auto-approve, below that it goes to the review queue. Reviewers get a dashboard that shows pending items, lets them approve/reject/edit, and captures the delta when they make changes. We store (original_output, human_edit) pairs for later analysis and potential fine-tuning. The key was making the threshold configurable — some tenants wanted stricter review; others were comfortable with higher auto-approval.
Alternatives considered
- Fully automated — ship AI output directly: Quality wasn't good enough. One bad summary in front of a client would have damaged trust.
- Fully manual — no auto-approval: Would have created a backlog. Most outputs were fine; we only needed human review for the edge cases.
Lessons learned
- Confidence thresholds are a product decision, not just engineering. We let tenants tune theirs based on their risk tolerance.
- Capture feedback even when you're not sure how you'll use it. The (original, edited) pairs became valuable for prompt iteration.
- The review UI matters. A clunky dashboard would have made reviewers avoid the queue; we invested in making it fast and clear.