Datasets

Create and manage collections of test inputs and expected outputs for benchmarking, evaluating, and improving your AI agents.

Datasets are structured collections of inputs (and optionally expected outputs) that you use to test your AI application against known scenarios — from production edge cases to synthetic benchmarks.

Why Use Datasets

Benchmark before deploying — test new model versions or prompt changes against a known set of inputs before shipping
Capture production edge cases — save real traces that caused issues as test cases for regression testing
Structured evaluation — run experiments across consistent inputs to compare models, prompts, or configurations
Collaborative curation — your team can build and refine datasets together through the UI or SDKs
Custom workflows — use datasets via the API for fine-tuning, few-shot prompting, or automated CI testing

Getting Started

Step 1: Create a Dataset

Navigate to Datasets in your project sidebar and click Create Dataset. Provide a name and optional description.

You can also create datasets programmatically:

from ants_platform import AntsPlatform
 
ants = AntsPlatform()
 
dataset = ants.create_dataset(
    name="customer-support-qa",
    description="Common customer support questions with expected responses",
)

Step 2: Add Items

Each dataset item consists of an input (required) and an optional expected output. Add items through:

UI — click "Add Item" in the dataset view
SDK — create items programmatically
CSV import — bulk upload from spreadsheets
From traces — save production traces directly as dataset items

ants.create_dataset_item(
    dataset_name="customer-support-qa",
    input={"question": "How do I reset my password?"},
    expected_output={"answer": "Go to Settings > Security > Reset Password"},
)

Step 3: Run Experiments

Use your dataset to benchmark different configurations. Each experiment run evaluates your application against every item in the dataset and records the results.

Navigate to the dataset and click New Run, or trigger runs via the SDK for automated testing pipelines.

Step 4: Compare Results

View experiment runs side-by-side in the dataset dashboard. Compare scores, latency, cost, and output quality across different model versions or prompt configurations.

Dataset Items

Structure

Each dataset item contains:

Field	Required	Description
`input`	Yes	The input to your application (JSON)
`expected_output`	No	The expected/ideal output for evaluation
`metadata`	No	Additional context (tags, categories, difficulty level)
`source_trace_id`	No	Link back to the production trace this item was created from

Adding Items from Production

When reviewing traces in the platform, you can save any trace as a dataset item with one click. This is the fastest way to build datasets from real-world usage — especially for edge cases and failures.

CSV Import

For bulk imports, use the CSV upload feature:

Go to your dataset
Click Import CSV
Map CSV columns to input/expected output fields
Preview and confirm the import

Experiment Runs

Each experiment run evaluates your application against a dataset and records:

Output for each input
Scores from evaluators (automated or manual)
Latency per item
Cost per item
Trace links for debugging individual items

Comparing Runs

The runs table shows aggregate metrics across all runs for a dataset. Click into any run to see item-level results, or compare multiple runs side-by-side.

Organizing Datasets

Use descriptive names with forward slashes to create folder structures:

evaluation/qa-dataset
evaluation/safety-checks
benchmarks/model-comparison-v2
production-edge-cases/auth-failures

API Access

All dataset operations are available via the Python and JavaScript SDKs:

# List datasets
datasets = ants.get_datasets()
 
# Get a specific dataset
dataset = ants.get_dataset("customer-support-qa")
 
# List items
items = dataset.items

See the Python SDK and JS/TS SDK docs for full API reference.

Next Steps

Human Annotation Model Governance Guides

Overview Human Annotation