API Reference
Core Methods
- login(key=None, base_url=None, off=False)[source]
Authenticate with the RagMetrics API
- Parameters:
key – Optional API key, defaults to environment variable RAGMETRICS_API_KEY
base_url – Optional custom base URL for the API
off – Whether to disable logging
- Returns:
True if login was successful
- Return type:
bool
- monitor(client, metadata=None, callback=None)[source]
Wrap LLM clients to automatically log interactions
- Parameters:
client – The LLM client to monitor
metadata – Optional metadata to include with logged traces
callback – Optional callback function for custom processing
- Returns:
Wrapped client that logs interactions
Classes
Cohort
- class Cohort(name, generator_model=None, rag_pipeline=None, system_prompt=None)[source]
Bases:
object
A class representing a group of models or pipelines to be evaluated.
A cohort defines a specific configuration to test in an experiment. It can represent either a single model or a RAG pipeline configuration. Cohorts allow comparing different setups against the same dataset and criteria.
- __init__(name, generator_model=None, rag_pipeline=None, system_prompt=None)[source]
Initialize a new Cohort instance.
Note: A cohort must include either generator_model OR rag_pipeline, not both.
Example - Creating model cohorts:
# For comparing different models: cohorts = [ Cohort(name="GPT-4", generator_model="gpt-4"), Cohort(name="Claude 3 Sonnet", generator_model="claude-3-sonnet-20240229"), Cohort(name="Llama 3", generator_model="llama3-8b-8192") ] # For comparing different models with custom system prompts: cohorts = [ Cohort( name="GPT-4 with QA Prompt", generator_model="gpt-4", system_prompt="You are a helpful assistant that answers questions accurately." ), Cohort( name="GPT-4 with Concise Prompt", generator_model="gpt-4", system_prompt="Provide extremely concise answers with minimal explanation." ) ]
Example - Creating RAG pipeline cohorts:
# For comparing different RAG approaches: cohorts = [ Cohort(name="Basic RAG", rag_pipeline="basic-rag-pipeline"), Cohort(name="Query Rewriting RAG", rag_pipeline="query-rewriting-rag"), Cohort(name="Hypothetical Document Embeddings", rag_pipeline="hyde-rag") ]
- Parameters:
name (
str
) – The name of the cohort (e.g., “GPT-4”, “RAG-v1”).generator_model (
str, optional
) – The model identifier to use for generation.rag_pipeline (
str, optional
) – The RAG pipeline configuration identifier.system_prompt (
str, optional
) – Override system prompt to use with this cohort.
- to_dict()[source]
Convert the Cohort instance to a dictionary for API communication.
- Returns:
Dictionary containing the cohort’s configuration.
- Return type:
dict
Criteria
- class Criteria(name, phase='', description='', prompt='', bool_true='', bool_false='', output_type='', header='', likert_score_1='', likert_score_2='', likert_score_3='', likert_score_4='', likert_score_5='', criteria_type='llm_judge', function_name='', match_type='', match_pattern='', test_string='', validation_status='', case_sensitive=False)[source]
Bases:
RagMetricsObject
Defines evaluation criteria for assessing LLM responses.
Criteria specify how to evaluate LLM responses in experiments and reviews. They can operate in two different modes:
LLM-based evaluation: Uses another LLM to judge responses based on specified rubrics like Likert scales, boolean judgments, or custom prompts.
Function-based evaluation: Uses programmatic rules like string matching to automatically evaluate responses.
Criteria can be applied to either the retrieval phase (evaluating context) or the generation phase (evaluating final answers).
- object_type: str = 'criteria'[source]
- __init__(name, phase='', description='', prompt='', bool_true='', bool_false='', output_type='', header='', likert_score_1='', likert_score_2='', likert_score_3='', likert_score_4='', likert_score_5='', criteria_type='llm_judge', function_name='', match_type='', match_pattern='', test_string='', validation_status='', case_sensitive=False)[source]
Initialize a new Criteria instance.
Example - Creating a 5-point Likert scale criteria:
import ragmetrics from ragmetrics import Criteria # Login ragmetrics.login("your-api-key") # Create a relevance criteria using a 5-point Likert scale relevance = Criteria( name="Relevance", phase="generation", output_type="5-point", criteria_type="llm_judge", header="How relevant is the response to the question?", likert_score_1="Not relevant at all", likert_score_2="Slightly relevant", likert_score_3="Moderately relevant", likert_score_4="Very relevant", likert_score_5="Completely relevant" ) relevance.save()
Example - Creating a boolean criteria:
# Create a factual correctness criteria using a boolean judgment factual = Criteria( name="Factually Correct", phase="generation", output_type="bool", criteria_type="llm_judge", header="Is the response factually correct based on the provided context?", bool_true="Yes, the response is factually correct and consistent with the context", bool_false="No, the response contains factual errors or contradicts the context" ) factual.save()
Example - Creating a string matching criteria (automated):
# Create an automated criteria that checks if a response contains a date contains_date = Criteria( name="Contains Date", phase="generation", output_type="bool", criteria_type="function", function_name="string_match", match_type="regex_match", match_pattern=r"\d{1,2}/\d{1,2}/\d{4}|\d{4}-\d{2}-\d{2}", test_string="The event occurred on 12/25/2023", case_sensitive=False ) contains_date.save()
Example - Creating a custom prompt criteria:
# Create a criteria with a custom prompt for more flexible evaluation custom_eval = Criteria( name="Reasoning Quality", phase="generation", output_type="prompt", criteria_type="llm_judge", description="Evaluate the quality of reasoning in the response", prompt=( "On a scale of 1-10, rate the quality of reasoning in the response." "Consider these factors:" "* Logical flow of arguments" "* Use of evidence" "* Consideration of alternatives" "* Absence of fallacies" "First explain your reasoning, then provide a final score between 1-10." ) ) custom_eval.save()
- Parameters:
name (
str
) – The criteria name (required).phase (
str
) – Either “retrieval” or “generation” (default: “”).description (
str
) – Description for prompt output type (default: “”).prompt (
str
) – Prompt for prompt output type (default: “”).bool_true (
str
) – True description for Boolean output type (default: “”).bool_false (
str
) – False description for Boolean output type (default: “”).output_type (
str
) – Output type, e.g., “5-point”, “bool”, or “prompt” (default: “”).header (
str
) – Header for 5-point or Boolean output types (default: “”).likert_score_1..5 (
str
) – Labels for a 5-point Likert scale (default: “”).criteria_type (
str
) – Implementation type, “llm_judge” or “function” (default: “llm_judge”).function_name (
str
) – Name of the function if criteria_type is “function” (default: “”).match_type (
str
) – For string_match function (e.g., “starts_with”, “ends_with”, “contains”, “regex_match”) (default: “”).match_pattern (
str
) – The pattern used for matching (default: “”).test_string (
str
) – A sample test string (default: “”).validation_status (
str
) – “valid” or “invalid” (default: “”).case_sensitive (
bool
) – Whether matching is case sensitive (default: False).
- to_dict()[source]
Convert the criteria object to a dictionary format for API communication.
The specific fields included in the dictionary depend on the criteria’s output_type and criteria_type.
- Returns:
- Dictionary representation of the criteria, including all relevant
fields based on the output_type and criteria_type.
- Return type:
dict
- classmethod from_dict(data)[source]
Create a Criteria instance from a dictionary.
Used internally when downloading criteria from the RagMetrics API.
- Parameters:
data (
dict
) – Dictionary containing criteria data.- Returns:
A new Criteria instance with the specified data.
- Return type:
Criteria
Dataset
- class Dataset(name, examples=[], source_type='', source_file='', questions_qty=0)[source]
Bases:
RagMetricsObject
A collection of examples for evaluation.
Datasets are used in experiments to test models and RAG pipelines against a consistent set of questions. They provide the questions and ground truth information needed for systematic evaluation.
Datasets can be created programmatically, uploaded from files, or downloaded from the RagMetrics platform.
- object_type: str = 'dataset'[source]
- __init__(name, examples=[], source_type='', source_file='', questions_qty=0)[source]
Initialize a new Dataset instance.
Example - Creating and saving a dataset:
import ragmetrics from ragmetrics import Example, Dataset # Login to RagMetrics ragmetrics.login("your-api-key") # Create examples examples = [ Example( question="What is the capital of France?", ground_truth_context="France is a country in Western Europe. Its capital is Paris.", ground_truth_answer="Paris" ), Example( question="Who wrote Hamlet?", ground_truth_context="Hamlet is a tragedy written by William Shakespeare.", ground_truth_answer="William Shakespeare" ) ] # Create dataset dataset = Dataset(name="Geography and Literature QA", examples=examples) # Save to RagMetrics platform dataset.save() print(f"Dataset saved with ID: {dataset.id}")
Example - Downloading and using an existing dataset:
# Download dataset by name dataset = Dataset.download(name="Geography and Literature QA") # Or download by ID # dataset = Dataset.download(id=12345) # Iterate through examples for example in dataset: print(f"Question: {example.question}") print(f"Answer: {example.ground_truth_answer}") # Access example count print(f"Dataset contains {len(dataset.examples)} examples")
- Parameters:
name (
str
) – The name of the dataset.examples (
list
) – List of Example instances (default: []).source_type (
str
) – Type of the data source (default: “”).source_file (
str
) – Path to the source file (default: “”).questions_qty (
int
) – Number of questions in the dataset (default: 0).
- to_dict()[source]
Convert the Dataset instance into a dictionary for API communication.
- Returns:
Dictionary containing the dataset name, source, examples, and quantity.
- Return type:
dict
- classmethod from_dict(data)[source]
Create a Dataset instance from a dictionary.
Used internally when downloading datasets from the RagMetrics API.
- Parameters:
data (
dict
) – Dictionary containing dataset information.- Returns:
A new Dataset instance with the specified data.
- Return type:
Dataset
- __iter__()[source]
Make the Dataset instance iterable over its examples.
This allows using a dataset in a for loop to iterate through examples.
Example:
dataset = Dataset.download(name="my-dataset") for example in dataset: print(example.question)
- Returns:
An iterator over the dataset’s examples.
- Return type:
iterator
Example
- class Example(question, ground_truth_context, ground_truth_answer)[source]
Bases:
object
A single example in a dataset for evaluation.
Each Example represents one test case consisting of a question, the ground truth context that contains the answer, and the expected ground truth answer.
Examples are used in experiments to evaluate how well a model or RAG pipeline performs on specific questions.
- __init__(question, ground_truth_context, ground_truth_answer)[source]
Initialize a new Example instance.
Example:
# Simple example with string context example = Example( question="What is the capital of France?", ground_truth_context="France is a country in Western Europe. Its capital is Paris.", ground_truth_answer="Paris" ) # Example with a list of context strings example_multi_context = Example( question="Is NYC beautiful?", ground_truth_context=[ "NYC is the biggest city in the east of US.", "NYC is on the eastern seaboard.", "NYC is a very beautiful city" ], ground_truth_answer="Yes" )
- Parameters:
question (
str
) – The question to be answered.ground_truth_context (
str or list
) – The context containing the answer. Can be a string or list of strings.ground_truth_answer (
str
) – The expected answer to the question.
- to_dict()[source]
Convert the Example instance into a dictionary for API requests.
- Returns:
Dictionary containing the example’s question, context, and answer.
- Return type:
dict
Experiment
- class Experiment(name, dataset, task, cohorts, criteria, judge_model)[source]
Bases:
object
A class representing an evaluation experiment.
An Experiment orchestrates the evaluation of one or more cohorts (model configurations) against a dataset using specified criteria. It handles all the complexity of coordinating the API calls, tracking progress, and retrieving results.
Experiments are the core way to systematically evaluate and compare LLM configurations in RagMetrics.
- __init__(name, dataset, task, cohorts, criteria, judge_model)[source]
Initialize a new Experiment instance.
Example - Basic experiment with existing components:
import ragmetrics from ragmetrics import Experiment, Cohort, Dataset, Task, Criteria # Login ragmetrics.login("your-api-key") # Download existing components by name dataset = Dataset.download(name="Geography QA") task = Task.download(name="Question Answering") # Create cohorts to compare cohorts = [ Cohort(name="GPT-4", generator_model="gpt-4"), Cohort(name="Claude 3", generator_model="claude-3-sonnet-20240229") ] # Use existing criteria (by name) criteria = ["Accuracy", "Relevance", "Conciseness"] # Create and run experiment experiment = Experiment( name="Model Comparison - Geography", dataset=dataset, task=task, cohorts=cohorts, criteria=criteria, judge_model="gpt-4" ) # Run the experiment and wait for results results = experiment.run()
Example - Complete experiment creation flow:
import ragmetrics from ragmetrics import Experiment, Cohort, Dataset, Task, Criteria, Example # Login ragmetrics.login("your-api-key") # 1. Create a dataset examples = [ Example( question="What is the capital of France?", ground_truth_context="France is a country in Western Europe. Its capital is Paris.", ground_truth_answer="Paris" ), Example( question="What is the largest planet in our solar system?", ground_truth_context="Jupiter is the largest planet in our solar system.", ground_truth_answer="Jupiter" ) ] dataset = Dataset(name="General Knowledge QA", examples=examples) dataset.save() # 2. Create a task task = Task( name="General QA Task", generator_model="gpt-4", system_prompt="You are a helpful assistant that answers questions accurately." ) task.save() # 3. Create criteria relevance = Criteria( name="Relevance", phase="generation", output_type="5-point", criteria_type="llm_judge", header="How relevant is the response to the question?", likert_score_1="Not relevant at all", likert_score_2="Slightly relevant", likert_score_3="Moderately relevant", likert_score_4="Very relevant", likert_score_5="Completely relevant" ) relevance.save() factual = Criteria( name="Factual Accuracy", phase="generation", output_type="bool", criteria_type="llm_judge", header="Is the answer factually correct?", bool_true="Yes, the answer is factually correct.", bool_false="No, the answer contains factual errors." ) factual.save() # 4. Define cohorts cohorts = [ Cohort(name="GPT-4", generator_model="gpt-4"), Cohort(name="Claude 3", generator_model="claude-3-sonnet-20240229"), Cohort(name="GPT-3.5", generator_model="gpt-3.5-turbo") ] # 5. Create experiment experiment = Experiment( name="Model Comparison - General Knowledge", dataset=dataset, task=task, cohorts=cohorts, criteria=[relevance, factual], judge_model="gpt-4" ) # 6. Run the experiment results = experiment.run()
- Parameters:
name (
str
) – The name of the experiment.dataset (
Dataset or str
) – The dataset to use for evaluation.task (
Task or str
) – The task definition to evaluate.cohorts (
list or str
) – List of cohorts to evaluate, or JSON string.criteria (
list or str
) – List of evaluation criteria.judge_model (
str
) – The model to use for judging responses.
- _process_dataset(dataset)[source]
Process and validate the dataset parameter.
Handles different ways of specifying a dataset (object, name, ID) and ensures it exists on the server.
- Parameters:
dataset (
Dataset or str
) – The dataset to process.- Returns:
The ID of the processed dataset.
- Return type:
str
- Raises:
ValueError – If the dataset is invalid or missing required attributes.
Exception – If the dataset cannot be found on the server.
- _process_task(task)[source]
Process and validate the task parameter.
Handles different ways of specifying a task (object, name, ID) and ensures it exists on the server.
- Parameters:
task (
Task or str
) – The task to process.- Returns:
The ID of the processed task.
- Return type:
str
- Raises:
ValueError – If the task is invalid or missing required attributes.
Exception – If the task cannot be found on the server.
- _process_cohorts()[source]
Process and validate the cohorts parameter.
Converts the cohorts parameter (list of Cohort objects or JSON string) to a JSON string for the API. Validates that each cohort is properly configured.
- Returns:
JSON string containing the processed cohorts.
- Return type:
str
- Raises:
ValueError – If cohorts are invalid or improperly configured.
- _process_criteria(criteria)[source]
Process and validate the criteria parameter.
Handles different ways of specifying criteria (objects, names, IDs) and ensures they exist on the server.
- Parameters:
criteria (
list or str
) – The criteria to process.- Returns:
List of criteria IDs.
- Return type:
list
- Raises:
ValueError – If the criteria are invalid.
Exception – If criteria cannot be found on the server.
- _build_payload()[source]
Build the payload for the API request.
Processes all components of the experiment and constructs the complete payload to send to the server.
- Returns:
The payload to send to the server.
- Return type:
dict
- _call_api(payload)[source]
Make the API call to run the experiment.
Sends the experiment configuration to the server and handles the response.
- Parameters:
payload (
dict
) – The payload to send to the API.- Returns:
The API response.
- Return type:
dict
- Raises:
Exception – If the API call fails.
- run_async()[source]
Submit the experiment asynchronously.
Starts the experiment on the server without waiting for it to complete. Use this when you want to start an experiment and check its status later.
- Returns:
A Future object that will contain the API response.
- Return type:
concurrent.futures.Future
- run(poll_interval=2)[source]
Run the experiment and display real-time progress.
This method submits the experiment to the server and then polls for progress updates, displaying a progress bar. It blocks until the experiment completes or fails.
Example:
# Create the experiment experiment = Experiment( name="Model Comparison", dataset="My Dataset", task="QA Task", cohorts=cohorts, criteria=criteria, judge_model="gpt-4" ) # Run with default polling interval (2 seconds) results = experiment.run() # Or run with custom polling interval results = experiment.run(poll_interval=5) # Check every 5 seconds
- Parameters:
poll_interval (
int
) – Time between progress checks in seconds (default: 2).- Returns:
The experiment results once completed.
- Return type:
dict
- Raises:
Exception – If the experiment fails to start or encounters an error.
ReviewQueue
- class ReviewQueue(name, condition='', criteria=None, judge_model=None, dataset=None)[source]
Bases:
RagMetricsObject
Manages a queue of traces for manual review and evaluation.
A ReviewQueue allows for structured human evaluation of LLM interactions by collecting traces that match specific conditions and applying evaluation criteria. It supports both automated and human-in-the-loop evaluation workflows.
- object_type: str = 'reviews'[source]
- __init__(name, condition='', criteria=None, judge_model=None, dataset=None)[source]
Initialize a new ReviewQueue instance.
- Parameters:
name (
str
) – The name of the review queue.condition (
str, optional
) – SQL-like condition to filter traces (default: “”).criteria (
list or str, optional
) – Evaluation criteria to apply.judge_model (
str, optional
) – LLM model to use for automated evaluation.dataset (
Dataset or str, optional
) – Dataset to use for evaluation.
- __setattr__(key, value)[source]
Override attribute setting to enable edit mode when modifying an existing queue.
This automatically sets edit_mode to True when any attribute (except edit_mode itself) is changed on a queue with an existing ID.
- Parameters:
key (
str
) – The attribute name.value – The value to set.
- property traces[source]
Get the traces associated with this review queue.
Lazily loads traces from the server if they haven’t been loaded yet.
- Returns:
List of Trace objects in this review queue.
- Return type:
list
- _process_dataset(dataset)[source]
Process and validate the dataset parameter.
Converts various dataset representations (object, ID, name) to a dataset ID that can be used in API requests.
- Parameters:
dataset (
Dataset, int, str
) – The dataset to process.- Returns:
The ID of the processed dataset.
- Return type:
int
- Raises:
ValueError – If the dataset is invalid or not found.
Exception – If the dataset cannot be found on the server.
- _process_criteria(criteria)[source]
Process and validate the criteria parameter.
Converts various criteria representations (object, dict, ID, name) to a list of criteria IDs that can be used in API requests.
- Parameters:
criteria (
list, Criteria, str, int
) – The criteria to process.- Returns:
List of criteria IDs.
- Return type:
list
- Raises:
ValueError – If the criteria are invalid.
Exception – If criteria cannot be found on the server.
- to_dict()[source]
Convert the ReviewQueue to a dictionary for API communication.
- Returns:
- Dictionary representation of the review queue with all necessary
fields for API communication.
- Return type:
dict
- classmethod from_dict(data)[source]
Create a ReviewQueue instance from a dictionary.
- Parameters:
data (
dict
) – Dictionary containing review queue information.- Returns:
A new ReviewQueue instance with the specified data.
- Return type:
ReviewQueue
- __iter__()[source]
Make the ReviewQueue iterable over its traces.
- Returns:
An iterator over the review queue’s traces.
- Return type:
iterator
Task
- class Task(name, generator_model='', system_prompt='')[source]
Bases:
RagMetricsObject
A class representing a specific task configuration for LLM evaluations.
Tasks define how models should generate responses for each example in a dataset. This includes specifying the prompt format, system message, and any other parameters needed for generation.
- object_type: str = 'task'[source]
- __init__(name, generator_model='', system_prompt='')[source]
Initialize a new Task instance.
Example - Creating a simple QA task:
import ragmetrics from ragmetrics import Task # Login ragmetrics.login("your-api-key") # Create a basic QA task qa_task = Task( name="Question Answering", generator_model="gpt-4", system_prompt="You are a helpful assistant that answers questions accurately and concisely." ) # Save the task for future use qa_task.save()
Example - Creating a RAG evaluation task:
# RAG evaluation task rag_task = Task( name="RAG Evaluation", generator_model="gpt-4", system_prompt="Answer the question using only the provided context. If the context doesn't contain the answer, say 'I don't know'." ) # Save the task rag_task.save()
- Parameters:
name (
str
) – The name of the task.generator_model (
str, optional
) – Default model for generation if not specified in cohort.system_prompt (
str, optional
) – System prompt to use when generating responses.
- to_dict()[source]
Convert the Task instance to a dictionary for API communication.
- Returns:
Dictionary containing the task configuration.
- Return type:
dict
- classmethod from_dict(data)[source]
Create a Task instance from a dictionary.
Used internally when downloading tasks from the RagMetrics API.
- Parameters:
data (
dict
) – Dictionary containing task information.- Returns:
A new Task instance with the specified data.
- Return type:
Task
Trace
- class Trace(id=None, created_at=None, input=None, output=None, raw_input=None, raw_output=None, contexts=None, metadata=None)[source]
Bases:
RagMetricsObject
Represents a logged interaction between an application and an LLM.
A Trace captures the complete details of an LLM interaction, including raw inputs and outputs, processed data, metadata, and contextual information. Traces can be retrieved, modified, and saved back to the RagMetrics platform.
- object_type: str = 'trace'[source]
- __init__(id=None, created_at=None, input=None, output=None, raw_input=None, raw_output=None, contexts=None, metadata=None)[source]
Initialize a new Trace instance.
- Parameters:
id (
str, optional
) – Unique identifier of the trace.created_at (
str, optional
) – Timestamp when the trace was created.input (
str, optional
) – The processed/formatted input to the LLM.output (
str, optional
) – The processed/formatted output from the LLM.raw_input (
dict, optional
) – The raw input data sent to the LLM.raw_output (
dict, optional
) – The raw output data received from the LLM.contexts (
list, optional
) – List of context information provided during the interaction.metadata (
dict, optional
) – Additional metadata about the interaction.
- __setattr__(key, value)[source]
Override attribute setting to enable edit mode when modifying an existing trace.
This automatically sets edit_mode to True when any attribute (except edit_mode itself) is changed on a trace with an existing ID.
- Parameters:
key (
str
) – The attribute name.value – The value to set.
- to_dict()[source]
Convert the Trace object to a dictionary for API communication.
- Returns:
- A dictionary representation of the trace, with edit_mode flag
to indicate whether this is an update to an existing trace.
- Return type:
dict
- classmethod from_dict(data)[source]
Create a Trace instance from a dictionary.
- Parameters:
data (
dict
) – Dictionary containing trace information.- Returns:
A new Trace instance initialized with the provided data.
- Return type:
Trace