Docs
LLMOps Framework
Human Annotation

Human Annotation

Score and label your AI agent outputs through collaborative annotation queues with your team.

Human Annotation is a manual evaluation method where team members review and score traces, sessions, and observations. Use it to establish quality baselines, catch edge cases automated evaluators miss, and build labeled datasets for training.

Why Use Human Annotation

  • Establish quality baselines — create human-scored benchmarks to calibrate your automated evaluators against
  • Catch what automation misses — review nuanced outputs where automated scores fall short (tone, factual accuracy, safety)
  • Collaborative review — distribute annotation work across your team with managed queues and progress tracking
  • Consistent labeling — standardized score configurations ensure every reviewer uses the same criteria
  • Feed evaluation loops — annotation scores flow into your experiment comparisons and model governance dashboards

Getting Started

Step 1: Create Score Configurations

Before annotating, define what you're scoring. Navigate to Project Settings > Score Configs and create your scoring criteria.

Score types available:

TypeExampleUse Case
Numeric1-5 ratingQuality, relevance, helpfulness
CategoricalGood / Bad / NeutralQuick triage, sentiment
BooleanYes / NoFactual correctness, safety pass/fail

Step 2: Create an Annotation Queue

Navigate to Human Annotation in the sidebar and click Create Annotation Queue.

Configure your queue:

  • Name — descriptive name (e.g., "Weekly QA Review", "Safety Audit")
  • Score configs — select which scores annotators will fill in
  • Description — instructions for reviewers on how to score

Step 3: Add Items to the Queue

Populate your annotation queue with traces to review:

  • From traces table — select traces and add them to a queue
  • Automatically — configure filters to auto-populate queues based on criteria (e.g., low confidence scores, error traces)

Step 4: Annotate

Team members open the annotation queue and work through items one by one:

  1. Review the full trace context — input, output, intermediate steps
  2. See any existing automated scores for reference
  3. Apply scores based on the configured criteria
  4. Move to the next item

The annotation interface shows all relevant context so reviewers can make informed judgments without switching between views.

Annotation Queues

How They Work

Annotation queues organize the review workload:

  • Each queue has a set of score configurations that annotators fill in
  • Items are presented one at a time for focused review
  • Progress tracking shows how many items are reviewed vs. remaining
  • Multiple team members can work on the same queue simultaneously

Queue Management

ActionDescription
Create queueDefine name, description, and score configs
Add itemsPopulate from traces, observations, or sessions
Assign reviewersInvite team members to annotate
Track progressMonitor completion rate and annotation quality

Scoring Traces Directly

You can also score individual traces without using queues:

  1. Open any trace in the platform
  2. Click the Annotate button
  3. Select a score configuration
  4. Enter your score value
  5. The score appears in the trace's Scores tab

This is useful for ad-hoc reviews or when you spot something during normal trace exploration.

Using Annotation Data

In Experiments

When comparing experiment runs, annotation scores appear alongside automated scores. This lets you validate whether your automated evaluators agree with human judgment.

As Evaluation Baselines

Use annotation scores to:

  • Calibrate LLM-as-Judge evaluators against human preferences
  • Identify where automated scores diverge from human assessment
  • Build training data for custom evaluation models

In Dashboards

Annotation scores flow into your project dashboards and can be filtered, aggregated, and tracked over time — giving you a human-quality signal alongside your automated metrics.

Best Practices

Define Clear Scoring Guidelines

Write explicit instructions for each score configuration:

  • What does a score of 5 vs 1 mean?
  • What edge cases should reviewers watch for?
  • When should a reviewer skip vs. flag an item?

Start Small

Begin with a focused queue (50-100 items) on a specific quality dimension. Validate that the scoring criteria are clear before scaling up.

Calibrate Across Reviewers

Have multiple reviewers score the same items initially to check inter-annotator agreement. Adjust guidelines where disagreement is high.

Next Steps