Docs
LLMOps Framework
Datasets

Datasets

Create and manage collections of test inputs and expected outputs for benchmarking, evaluating, and improving your AI agents.

Datasets are structured collections of inputs (and optionally expected outputs) that you use to test your AI application against known scenarios — from production edge cases to synthetic benchmarks.

Why Use Datasets

  • Benchmark before deploying — test new model versions or prompt changes against a known set of inputs before shipping
  • Capture production edge cases — save real traces that caused issues as test cases for regression testing
  • Structured evaluation — run experiments across consistent inputs to compare models, prompts, or configurations
  • Collaborative curation — your team can build and refine datasets together through the UI or SDKs
  • Custom workflows — use datasets via the API for fine-tuning, few-shot prompting, or automated CI testing

Getting Started

Step 1: Create a Dataset

Navigate to Datasets in your project sidebar and click Create Dataset. Provide a name and optional description.

You can also create datasets programmatically:

from ants_platform import AntsPlatform
 
ants = AntsPlatform()
 
dataset = ants.create_dataset(
    name="customer-support-qa",
    description="Common customer support questions with expected responses",
)

Step 2: Add Items

Each dataset item consists of an input (required) and an optional expected output. Add items through:

  • UI — click "Add Item" in the dataset view
  • SDK — create items programmatically
  • CSV import — bulk upload from spreadsheets
  • From traces — save production traces directly as dataset items
ants.create_dataset_item(
    dataset_name="customer-support-qa",
    input={"question": "How do I reset my password?"},
    expected_output={"answer": "Go to Settings > Security > Reset Password"},
)

Step 3: Run Experiments

Use your dataset to benchmark different configurations. Each experiment run evaluates your application against every item in the dataset and records the results.

Navigate to the dataset and click New Run, or trigger runs via the SDK for automated testing pipelines.

Step 4: Compare Results

View experiment runs side-by-side in the dataset dashboard. Compare scores, latency, cost, and output quality across different model versions or prompt configurations.

Dataset Items

Structure

Each dataset item contains:

FieldRequiredDescription
inputYesThe input to your application (JSON)
expected_outputNoThe expected/ideal output for evaluation
metadataNoAdditional context (tags, categories, difficulty level)
source_trace_idNoLink back to the production trace this item was created from

Adding Items from Production

When reviewing traces in the platform, you can save any trace as a dataset item with one click. This is the fastest way to build datasets from real-world usage — especially for edge cases and failures.

CSV Import

For bulk imports, use the CSV upload feature:

  1. Go to your dataset
  2. Click Import CSV
  3. Map CSV columns to input/expected output fields
  4. Preview and confirm the import

Experiment Runs

Each experiment run evaluates your application against a dataset and records:

  • Output for each input
  • Scores from evaluators (automated or manual)
  • Latency per item
  • Cost per item
  • Trace links for debugging individual items

Comparing Runs

The runs table shows aggregate metrics across all runs for a dataset. Click into any run to see item-level results, or compare multiple runs side-by-side.

Organizing Datasets

Use descriptive names with forward slashes to create folder structures:

evaluation/qa-dataset
evaluation/safety-checks
benchmarks/model-comparison-v2
production-edge-cases/auth-failures

API Access

All dataset operations are available via the Python and JavaScript SDKs:

# List datasets
datasets = ants.get_datasets()
 
# Get a specific dataset
dataset = ants.get_dataset("customer-support-qa")
 
# List items
items = dataset.items

See the Python SDK and JS/TS SDK docs for full API reference.

Next Steps