Case study

Designing a Human-in-the-Loop Review Pipeline

Building a pipeline where AI generates first drafts and humans review before content reaches users — job queues, confidence thresholds, and feedback loops.

Node.jsPostgreSQLTypeScript

Context

We added an AI feature that generated draft content (summaries, categorizations) for user review. The AI wasn't accurate enough to ship directly; humans had to approve or edit before anything went live. The challenge was designing a pipeline that handled the queue, routed work to reviewers, supported confidence-based auto-approval for high-confidence outputs, and fed corrections back so we could improve the model over time.

Constraints

Reviewers are a bottleneck — we needed to auto-approve when confidence was high enough
Feedback loop — rejections and edits had to be captured for future model improvement
Multi-tenant — each tenant might have different review workflows and thresholds

Architecture

We built a job queue (PostgreSQL-backed, with a simple polling worker) where each AI-generated item has a state: pending_review, approved, rejected, needs_edit. The AI service writes to the queue with a confidence score. A configurable threshold per tenant determines auto-approval — above 0.9 we auto-approve, below that it goes to the review queue. Reviewers get a dashboard that shows pending items, lets them approve/reject/edit, and captures the delta when they make changes. We store (original_output, human_edit) pairs for later analysis and potential fine-tuning. The key was making the threshold configurable — some tenants wanted stricter review; others were comfortable with higher auto-approval.

Alternatives considered

Fully automated — ship AI output directly: Quality wasn't good enough. One bad summary in front of a client would have damaged trust.
Fully manual — no auto-approval: Would have created a backlog. Most outputs were fine; we only needed human review for the edge cases.

Lessons learned

Confidence thresholds are a product decision, not just engineering. We let tenants tune theirs based on their risk tolerance.
Capture feedback even when you're not sure how you'll use it. The (original, edited) pairs became valuable for prompt iteration.
The review UI matters. A clunky dashboard would have made reviewers avoid the queue; we invested in making it fast and clear.

All case studies