API Reference

Core Methods

login(key=None, base_url=None, off=False)[source]

Authenticate with the RagMetrics API

Parameters:
  • key – Optional API key, defaults to environment variable RAGMETRICS_API_KEY

  • base_url – Optional custom base URL for the API

  • off – Whether to disable logging

Returns:

True if login was successful

Return type:

bool

monitor(client, metadata=None, callback=None)[source]

Wrap LLM clients to automatically log interactions

Parameters:
  • client – The LLM client to monitor

  • metadata – Optional metadata to include with logged traces

  • callback – Optional callback function for custom processing

Returns:

Wrapped client that logs interactions

trace_function_call(func)[source]

Decorator to trace function execution for tracking retrieval in RAG pipelines

Parameters:

func – The function to trace

Returns:

Traced function

Classes

Cohort

class Cohort(name, generator_model=None, rag_pipeline=None, system_prompt=None)[source]

Bases: object

A class representing a group of models or pipelines to be evaluated.

A cohort defines a specific configuration to test in an experiment. It can represent either a single model or a RAG pipeline configuration. Cohorts allow comparing different setups against the same dataset and criteria.

__init__(name, generator_model=None, rag_pipeline=None, system_prompt=None)[source]

Initialize a new Cohort instance.

Note: A cohort must include either generator_model OR rag_pipeline, not both.

Example - Creating model cohorts:

# For comparing different models:
cohorts = [
    Cohort(name="GPT-4", generator_model="gpt-4"),
    Cohort(name="Claude 3 Sonnet", generator_model="claude-3-sonnet-20240229"),
    Cohort(name="Llama 3", generator_model="llama3-8b-8192")
]

# For comparing different models with custom system prompts:
cohorts = [
    Cohort(
        name="GPT-4 with QA Prompt",
        generator_model="gpt-4",
        system_prompt="You are a helpful assistant that answers questions accurately."
    ),
    Cohort(
        name="GPT-4 with Concise Prompt",
        generator_model="gpt-4",
        system_prompt="Provide extremely concise answers with minimal explanation."
    )
]

Example - Creating RAG pipeline cohorts:

# For comparing different RAG approaches:
cohorts = [
    Cohort(name="Basic RAG", rag_pipeline="basic-rag-pipeline"),
    Cohort(name="Query Rewriting RAG", rag_pipeline="query-rewriting-rag"),
    Cohort(name="Hypothetical Document Embeddings", rag_pipeline="hyde-rag")
]
Parameters:
  • name (str) – The name of the cohort (e.g., “GPT-4”, “RAG-v1”).

  • generator_model (str, optional) – The model identifier to use for generation.

  • rag_pipeline (str, optional) – The RAG pipeline configuration identifier.

  • system_prompt (str, optional) – Override system prompt to use with this cohort.

to_dict()[source]

Convert the Cohort instance to a dictionary for API communication.

Returns:

Dictionary containing the cohort’s configuration.

Return type:

dict

Criteria

class Criteria(name, phase='', description='', prompt='', bool_true='', bool_false='', output_type='', header='', likert_score_1='', likert_score_2='', likert_score_3='', likert_score_4='', likert_score_5='', criteria_type='llm_judge', function_name='', match_type='', match_pattern='', test_string='', validation_status='', case_sensitive=False)[source]

Bases: RagMetricsObject

Defines evaluation criteria for assessing LLM responses.

Criteria specify how to evaluate LLM responses in experiments and reviews. They can operate in two different modes:

  1. LLM-based evaluation: Uses another LLM to judge responses based on specified rubrics like Likert scales, boolean judgments, or custom prompts.

  2. Function-based evaluation: Uses programmatic rules like string matching to automatically evaluate responses.

Criteria can be applied to either the retrieval phase (evaluating context) or the generation phase (evaluating final answers).

object_type: str = 'criteria'[source]
__init__(name, phase='', description='', prompt='', bool_true='', bool_false='', output_type='', header='', likert_score_1='', likert_score_2='', likert_score_3='', likert_score_4='', likert_score_5='', criteria_type='llm_judge', function_name='', match_type='', match_pattern='', test_string='', validation_status='', case_sensitive=False)[source]

Initialize a new Criteria instance.

Example - Creating a 5-point Likert scale criteria:

import ragmetrics
from ragmetrics import Criteria

# Login
ragmetrics.login("your-api-key")

# Create a relevance criteria using a 5-point Likert scale
relevance = Criteria(
    name="Relevance",
    phase="generation",
    output_type="5-point",
    criteria_type="llm_judge",
    header="How relevant is the response to the question?",
    likert_score_1="Not relevant at all",
    likert_score_2="Slightly relevant",
    likert_score_3="Moderately relevant",
    likert_score_4="Very relevant",
    likert_score_5="Completely relevant"
)
relevance.save()

Example - Creating a boolean criteria:

# Create a factual correctness criteria using a boolean judgment
factual = Criteria(
    name="Factually Correct",
    phase="generation",
    output_type="bool",
    criteria_type="llm_judge",
    header="Is the response factually correct based on the provided context?",
    bool_true="Yes, the response is factually correct and consistent with the context",
    bool_false="No, the response contains factual errors or contradicts the context"
)
factual.save()

Example - Creating a string matching criteria (automated):

# Create an automated criteria that checks if a response contains a date
contains_date = Criteria(
    name="Contains Date",
    phase="generation",
    output_type="bool",
    criteria_type="function",
    function_name="string_match",
    match_type="regex_match",
    match_pattern=r"\d{1,2}/\d{1,2}/\d{4}|\d{4}-\d{2}-\d{2}",
    test_string="The event occurred on 12/25/2023",
    case_sensitive=False
)
contains_date.save()

Example - Creating a custom prompt criteria:

# Create a criteria with a custom prompt for more flexible evaluation
custom_eval = Criteria(
    name="Reasoning Quality",
    phase="generation",
    output_type="prompt",
    criteria_type="llm_judge",
    description="Evaluate the quality of reasoning in the response",
    prompt=(
        "On a scale of 1-10, rate the quality of reasoning in the response."
        "Consider these factors:"
        "* Logical flow of arguments"
        "* Use of evidence"
        "* Consideration of alternatives"
        "* Absence of fallacies"

        "First explain your reasoning, then provide a final score between 1-10."
    )
)
custom_eval.save()
Parameters:
  • name (str) – The criteria name (required).

  • phase (str) – Either “retrieval” or “generation” (default: “”).

  • description (str) – Description for prompt output type (default: “”).

  • prompt (str) – Prompt for prompt output type (default: “”).

  • bool_true (str) – True description for Boolean output type (default: “”).

  • bool_false (str) – False description for Boolean output type (default: “”).

  • output_type (str) – Output type, e.g., “5-point”, “bool”, or “prompt” (default: “”).

  • header (str) – Header for 5-point or Boolean output types (default: “”).

  • likert_score_1..5 (str) – Labels for a 5-point Likert scale (default: “”).

  • criteria_type (str) – Implementation type, “llm_judge” or “function” (default: “llm_judge”).

  • function_name (str) – Name of the function if criteria_type is “function” (default: “”).

  • match_type (str) – For string_match function (e.g., “starts_with”, “ends_with”, “contains”, “regex_match”) (default: “”).

  • match_pattern (str) – The pattern used for matching (default: “”).

  • test_string (str) – A sample test string (default: “”).

  • validation_status (str) – “valid” or “invalid” (default: “”).

  • case_sensitive (bool) – Whether matching is case sensitive (default: False).

to_dict()[source]

Convert the criteria object to a dictionary format for API communication.

The specific fields included in the dictionary depend on the criteria’s output_type and criteria_type.

Returns:

Dictionary representation of the criteria, including all relevant

fields based on the output_type and criteria_type.

Return type:

dict

classmethod from_dict(data)[source]

Create a Criteria instance from a dictionary.

Used internally when downloading criteria from the RagMetrics API.

Parameters:

data (dict) – Dictionary containing criteria data.

Returns:

A new Criteria instance with the specified data.

Return type:

Criteria

Dataset

class Dataset(name, examples=[], source_type='', source_file='', questions_qty=0)[source]

Bases: RagMetricsObject

A collection of examples for evaluation.

Datasets are used in experiments to test models and RAG pipelines against a consistent set of questions. They provide the questions and ground truth information needed for systematic evaluation.

Datasets can be created programmatically, uploaded from files, or downloaded from the RagMetrics platform.

object_type: str = 'dataset'[source]
__init__(name, examples=[], source_type='', source_file='', questions_qty=0)[source]

Initialize a new Dataset instance.

Example - Creating and saving a dataset:

import ragmetrics
from ragmetrics import Example, Dataset

# Login to RagMetrics
ragmetrics.login("your-api-key")

# Create examples
examples = [
    Example(
        question="What is the capital of France?",
        ground_truth_context="France is a country in Western Europe. Its capital is Paris.",
        ground_truth_answer="Paris"
    ),
    Example(
        question="Who wrote Hamlet?",
        ground_truth_context="Hamlet is a tragedy written by William Shakespeare.",
        ground_truth_answer="William Shakespeare"
    )
]

# Create dataset
dataset = Dataset(name="Geography and Literature QA", examples=examples)

# Save to RagMetrics platform
dataset.save()
print(f"Dataset saved with ID: {dataset.id}")

Example - Downloading and using an existing dataset:

# Download dataset by name
dataset = Dataset.download(name="Geography and Literature QA")

# Or download by ID
# dataset = Dataset.download(id=12345)

# Iterate through examples
for example in dataset:
    print(f"Question: {example.question}")
    print(f"Answer: {example.ground_truth_answer}")

# Access example count
print(f"Dataset contains {len(dataset.examples)} examples")
Parameters:
  • name (str) – The name of the dataset.

  • examples (list) – List of Example instances (default: []).

  • source_type (str) – Type of the data source (default: “”).

  • source_file (str) – Path to the source file (default: “”).

  • questions_qty (int) – Number of questions in the dataset (default: 0).

to_dict()[source]

Convert the Dataset instance into a dictionary for API communication.

Returns:

Dictionary containing the dataset name, source, examples, and quantity.

Return type:

dict

classmethod from_dict(data)[source]

Create a Dataset instance from a dictionary.

Used internally when downloading datasets from the RagMetrics API.

Parameters:

data (dict) – Dictionary containing dataset information.

Returns:

A new Dataset instance with the specified data.

Return type:

Dataset

__iter__()[source]

Make the Dataset instance iterable over its examples.

This allows using a dataset in a for loop to iterate through examples.

Example:

dataset = Dataset.download(name="my-dataset")
for example in dataset:
    print(example.question)
Returns:

An iterator over the dataset’s examples.

Return type:

iterator

Example

class Example(question, ground_truth_context, ground_truth_answer)[source]

Bases: object

A single example in a dataset for evaluation.

Each Example represents one test case consisting of a question, the ground truth context that contains the answer, and the expected ground truth answer.

Examples are used in experiments to evaluate how well a model or RAG pipeline performs on specific questions.

__init__(question, ground_truth_context, ground_truth_answer)[source]

Initialize a new Example instance.

Example:

# Simple example with string context
example = Example(
    question="What is the capital of France?",
    ground_truth_context="France is a country in Western Europe. Its capital is Paris.",
    ground_truth_answer="Paris"
)

# Example with a list of context strings
example_multi_context = Example(
    question="Is NYC beautiful?",
    ground_truth_context=[
        "NYC is the biggest city in the east of US.",
        "NYC is on the eastern seaboard.",
        "NYC is a very beautiful city"
    ],
    ground_truth_answer="Yes"
)
Parameters:
  • question (str) – The question to be answered.

  • ground_truth_context (str or list) – The context containing the answer. Can be a string or list of strings.

  • ground_truth_answer (str) – The expected answer to the question.

to_dict()[source]

Convert the Example instance into a dictionary for API requests.

Returns:

Dictionary containing the example’s question, context, and answer.

Return type:

dict

Experiment

class Experiment(name, dataset, task, cohorts, criteria, judge_model)[source]

Bases: object

A class representing an evaluation experiment.

An Experiment orchestrates the evaluation of one or more cohorts (model configurations) against a dataset using specified criteria. It handles all the complexity of coordinating the API calls, tracking progress, and retrieving results.

Experiments are the core way to systematically evaluate and compare LLM configurations in RagMetrics.

__init__(name, dataset, task, cohorts, criteria, judge_model)[source]

Initialize a new Experiment instance.

Example - Basic experiment with existing components:

import ragmetrics
from ragmetrics import Experiment, Cohort, Dataset, Task, Criteria

# Login
ragmetrics.login("your-api-key")

# Download existing components by name
dataset = Dataset.download(name="Geography QA")
task = Task.download(name="Question Answering")

# Create cohorts to compare
cohorts = [
    Cohort(name="GPT-4", generator_model="gpt-4"),
    Cohort(name="Claude 3", generator_model="claude-3-sonnet-20240229")
]

# Use existing criteria (by name)
criteria = ["Accuracy", "Relevance", "Conciseness"]

# Create and run experiment
experiment = Experiment(
    name="Model Comparison - Geography",
    dataset=dataset,
    task=task,
    cohorts=cohorts,
    criteria=criteria,
    judge_model="gpt-4"
)

# Run the experiment and wait for results
results = experiment.run()

Example - Complete experiment creation flow:

import ragmetrics
from ragmetrics import Experiment, Cohort, Dataset, Task, Criteria, Example

# Login
ragmetrics.login("your-api-key")

# 1. Create a dataset
examples = [
    Example(
        question="What is the capital of France?",
        ground_truth_context="France is a country in Western Europe. Its capital is Paris.",
        ground_truth_answer="Paris"
    ),
    Example(
        question="What is the largest planet in our solar system?",
        ground_truth_context="Jupiter is the largest planet in our solar system.",
        ground_truth_answer="Jupiter"
    )
]
dataset = Dataset(name="General Knowledge QA", examples=examples)
dataset.save()

# 2. Create a task
task = Task(
    name="General QA Task",
    generator_model="gpt-4",
    system_prompt="You are a helpful assistant that answers questions accurately."
)
task.save()

# 3. Create criteria
relevance = Criteria(
    name="Relevance",
    phase="generation",
    output_type="5-point",
    criteria_type="llm_judge",
    header="How relevant is the response to the question?",
    likert_score_1="Not relevant at all",
    likert_score_2="Slightly relevant",
    likert_score_3="Moderately relevant",
    likert_score_4="Very relevant",
    likert_score_5="Completely relevant"
)
relevance.save()

factual = Criteria(
    name="Factual Accuracy",
    phase="generation",
    output_type="bool",
    criteria_type="llm_judge",
    header="Is the answer factually correct?",
    bool_true="Yes, the answer is factually correct.",
    bool_false="No, the answer contains factual errors."
)
factual.save()

# 4. Define cohorts
cohorts = [
    Cohort(name="GPT-4", generator_model="gpt-4"),
    Cohort(name="Claude 3", generator_model="claude-3-sonnet-20240229"),
    Cohort(name="GPT-3.5", generator_model="gpt-3.5-turbo")
]

# 5. Create experiment
experiment = Experiment(
    name="Model Comparison - General Knowledge",
    dataset=dataset,
    task=task,
    cohorts=cohorts,
    criteria=[relevance, factual],
    judge_model="gpt-4"
)

# 6. Run the experiment
results = experiment.run()
Parameters:
  • name (str) – The name of the experiment.

  • dataset (Dataset or str) – The dataset to use for evaluation.

  • task (Task or str) – The task definition to evaluate.

  • cohorts (list or str) – List of cohorts to evaluate, or JSON string.

  • criteria (list or str) – List of evaluation criteria.

  • judge_model (str) – The model to use for judging responses.

_process_dataset(dataset)[source]

Process and validate the dataset parameter.

Handles different ways of specifying a dataset (object, name, ID) and ensures it exists on the server.

Parameters:

dataset (Dataset or str) – The dataset to process.

Returns:

The ID of the processed dataset.

Return type:

str

Raises:
  • ValueError – If the dataset is invalid or missing required attributes.

  • Exception – If the dataset cannot be found on the server.

_process_task(task)[source]

Process and validate the task parameter.

Handles different ways of specifying a task (object, name, ID) and ensures it exists on the server.

Parameters:

task (Task or str) – The task to process.

Returns:

The ID of the processed task.

Return type:

str

Raises:
  • ValueError – If the task is invalid or missing required attributes.

  • Exception – If the task cannot be found on the server.

_process_cohorts()[source]

Process and validate the cohorts parameter.

Converts the cohorts parameter (list of Cohort objects or JSON string) to a JSON string for the API. Validates that each cohort is properly configured.

Returns:

JSON string containing the processed cohorts.

Return type:

str

Raises:

ValueError – If cohorts are invalid or improperly configured.

_process_criteria(criteria)[source]

Process and validate the criteria parameter.

Handles different ways of specifying criteria (objects, names, IDs) and ensures they exist on the server.

Parameters:

criteria (list or str) – The criteria to process.

Returns:

List of criteria IDs.

Return type:

list

Raises:
  • ValueError – If the criteria are invalid.

  • Exception – If criteria cannot be found on the server.

_build_payload()[source]

Build the payload for the API request.

Processes all components of the experiment and constructs the complete payload to send to the server.

Returns:

The payload to send to the server.

Return type:

dict

_call_api(payload)[source]

Make the API call to run the experiment.

Sends the experiment configuration to the server and handles the response.

Parameters:

payload (dict) – The payload to send to the API.

Returns:

The API response.

Return type:

dict

Raises:

Exception – If the API call fails.

run_async()[source]

Submit the experiment asynchronously.

Starts the experiment on the server without waiting for it to complete. Use this when you want to start an experiment and check its status later.

Returns:

A Future object that will contain the API response.

Return type:

concurrent.futures.Future

run(poll_interval=2)[source]

Run the experiment and display real-time progress.

This method submits the experiment to the server and then polls for progress updates, displaying a progress bar. It blocks until the experiment completes or fails.

Example:

# Create the experiment
experiment = Experiment(
    name="Model Comparison",
    dataset="My Dataset",
    task="QA Task",
    cohorts=cohorts,
    criteria=criteria,
    judge_model="gpt-4"
)

# Run with default polling interval (2 seconds)
results = experiment.run()

# Or run with custom polling interval
results = experiment.run(poll_interval=5)  # Check every 5 seconds
Parameters:

poll_interval (int) – Time between progress checks in seconds (default: 2).

Returns:

The experiment results once completed.

Return type:

dict

Raises:

Exception – If the experiment fails to start or encounters an error.

ReviewQueue

class ReviewQueue(name, condition='', criteria=None, judge_model=None, dataset=None)[source]

Bases: RagMetricsObject

Manages a queue of traces for manual review and evaluation.

A ReviewQueue allows for structured human evaluation of LLM interactions by collecting traces that match specific conditions and applying evaluation criteria. It supports both automated and human-in-the-loop evaluation workflows.

object_type: str = 'reviews'[source]
__init__(name, condition='', criteria=None, judge_model=None, dataset=None)[source]

Initialize a new ReviewQueue instance.

Parameters:
  • name (str) – The name of the review queue.

  • condition (str, optional) – SQL-like condition to filter traces (default: “”).

  • criteria (list or str, optional) – Evaluation criteria to apply.

  • judge_model (str, optional) – LLM model to use for automated evaluation.

  • dataset (Dataset or str, optional) – Dataset to use for evaluation.

__setattr__(key, value)[source]

Override attribute setting to enable edit mode when modifying an existing queue.

This automatically sets edit_mode to True when any attribute (except edit_mode itself) is changed on a queue with an existing ID.

Parameters:
  • key (str) – The attribute name.

  • value – The value to set.

property traces[source]

Get the traces associated with this review queue.

Lazily loads traces from the server if they haven’t been loaded yet.

Returns:

List of Trace objects in this review queue.

Return type:

list

_process_dataset(dataset)[source]

Process and validate the dataset parameter.

Converts various dataset representations (object, ID, name) to a dataset ID that can be used in API requests.

Parameters:

dataset (Dataset, int, str) – The dataset to process.

Returns:

The ID of the processed dataset.

Return type:

int

Raises:
  • ValueError – If the dataset is invalid or not found.

  • Exception – If the dataset cannot be found on the server.

_process_criteria(criteria)[source]

Process and validate the criteria parameter.

Converts various criteria representations (object, dict, ID, name) to a list of criteria IDs that can be used in API requests.

Parameters:

criteria (list, Criteria, str, int) – The criteria to process.

Returns:

List of criteria IDs.

Return type:

list

Raises:
  • ValueError – If the criteria are invalid.

  • Exception – If criteria cannot be found on the server.

to_dict()[source]

Convert the ReviewQueue to a dictionary for API communication.

Returns:

Dictionary representation of the review queue with all necessary

fields for API communication.

Return type:

dict

classmethod from_dict(data)[source]

Create a ReviewQueue instance from a dictionary.

Parameters:

data (dict) – Dictionary containing review queue information.

Returns:

A new ReviewQueue instance with the specified data.

Return type:

ReviewQueue

__iter__()[source]

Make the ReviewQueue iterable over its traces.

Returns:

An iterator over the review queue’s traces.

Return type:

iterator

Task

class Task(name, generator_model='', system_prompt='')[source]

Bases: RagMetricsObject

A class representing a specific task configuration for LLM evaluations.

Tasks define how models should generate responses for each example in a dataset. This includes specifying the prompt format, system message, and any other parameters needed for generation.

object_type: str = 'task'[source]
__init__(name, generator_model='', system_prompt='')[source]

Initialize a new Task instance.

Example - Creating a simple QA task:

import ragmetrics
from ragmetrics import Task

# Login
ragmetrics.login("your-api-key")

# Create a basic QA task
qa_task = Task(
    name="Question Answering",
    generator_model="gpt-4",
    system_prompt="You are a helpful assistant that answers questions accurately and concisely."
)

# Save the task for future use
qa_task.save()

Example - Creating a RAG evaluation task:

# RAG evaluation task
rag_task = Task(
    name="RAG Evaluation",
    generator_model="gpt-4",
    system_prompt="Answer the question using only the provided context. If the context doesn't contain the answer, say 'I don't know'."
)

# Save the task
rag_task.save()
Parameters:
  • name (str) – The name of the task.

  • generator_model (str, optional) – Default model for generation if not specified in cohort.

  • system_prompt (str, optional) – System prompt to use when generating responses.

to_dict()[source]

Convert the Task instance to a dictionary for API communication.

Returns:

Dictionary containing the task configuration.

Return type:

dict

classmethod from_dict(data)[source]

Create a Task instance from a dictionary.

Used internally when downloading tasks from the RagMetrics API.

Parameters:

data (dict) – Dictionary containing task information.

Returns:

A new Task instance with the specified data.

Return type:

Task

Trace

class Trace(id=None, created_at=None, input=None, output=None, raw_input=None, raw_output=None, contexts=None, metadata=None)[source]

Bases: RagMetricsObject

Represents a logged interaction between an application and an LLM.

A Trace captures the complete details of an LLM interaction, including raw inputs and outputs, processed data, metadata, and contextual information. Traces can be retrieved, modified, and saved back to the RagMetrics platform.

object_type: str = 'trace'[source]
__init__(id=None, created_at=None, input=None, output=None, raw_input=None, raw_output=None, contexts=None, metadata=None)[source]

Initialize a new Trace instance.

Parameters:
  • id (str, optional) – Unique identifier of the trace.

  • created_at (str, optional) – Timestamp when the trace was created.

  • input (str, optional) – The processed/formatted input to the LLM.

  • output (str, optional) – The processed/formatted output from the LLM.

  • raw_input (dict, optional) – The raw input data sent to the LLM.

  • raw_output (dict, optional) – The raw output data received from the LLM.

  • contexts (list, optional) – List of context information provided during the interaction.

  • metadata (dict, optional) – Additional metadata about the interaction.

__setattr__(key, value)[source]

Override attribute setting to enable edit mode when modifying an existing trace.

This automatically sets edit_mode to True when any attribute (except edit_mode itself) is changed on a trace with an existing ID.

Parameters:
  • key (str) – The attribute name.

  • value – The value to set.

to_dict()[source]

Convert the Trace object to a dictionary for API communication.

Returns:

A dictionary representation of the trace, with edit_mode flag

to indicate whether this is an update to an existing trace.

Return type:

dict

classmethod from_dict(data)[source]

Create a Trace instance from a dictionary.

Parameters:

data (dict) – Dictionary containing trace information.

Returns:

A new Trace instance initialized with the provided data.

Return type:

Trace