Using the API

RagMetrics offers two APIs:

Pull API: RagMetrics will pull data from your REST endpoint and evaluate the results. Experiment runs are triggered from the RagMetrics GUI.
Push API: Your local Python code will push data to RagMetrics. Experiment runs are triggered from your Python code.

Pull API

In this API, RagMetrics connects to your code through a REST endpoint, provides input as if from a user, waits for your code to process it, collects the output and retrieved contexts and evaluates the results.

Click to watch a video demo

Step 1: Understand the Data Format

The RagMetrics API uses JSON for both input and output.
Labeled Data: Your labeled data set should include:
- Sample inputs or user questions that you would expect for your conversational AI bot.
- The correct answers that the system should deliver.
- Correct contexts or sources that are relevant to the answers.
The API will take this data, one input at a time, and feed it into your pipeline.
Input JSON: The input to your pipeline via the API should include:
- The question from your labeled data set.
- Any A/B test parameters that you want to use to switch between different parts of your code, like retrieval strategies. These are optional.
Output JSON: Your pipeline should return a JSON object that includes:
- The generated answer from your LLM. This is the only required field.
- The retrieved contexts (optional), which is a list of contexts with metadata and content.
- The LLM response (optional).

Step 2: Set Up Your Development Environment

You can use any environment you like, but this demo uses Replit, a free online coding platform that allows you to set up a web server.
Knowledge Base: Create a list of strings that represent your knowledge base. In this demo, the knowledge base is a list of customer service policies copied from an Excel spreadsheet.
RAG Method: Create a function (e.g., rag in the demo) that takes a JSON input and returns a JSON output, as defined in Step 1. This method should implement the logic of your LLM pipeline.
Parsing Input: In your rag method, parse the input JSON to get:
- The question (e.g., from the “row” element).
- Any A/B test parameters (e.g., the “retrieval strategy”).
Retrieval Logic: Implement your retrieval logic. In the demo, the retrieval strategy is switched between ‘random’ (retrieving a random context from the knowledge base) and ‘embeddings’ (retrieving a context based on embeddings).
Response: Format the output as a JSON with at least the generated_answer field. You can also add the optional contexts and llm_response fields.
Create an Endpoint: Use Replit or a similar platform to set up an endpoint for your LLM pipeline.

Step 3: Connect Your Endpoint to RagMetrics

In the RagMetrics platform, go to the API demo section and create a new “rag endpoint”.
Give your endpoint a name (e.g., “my LLM pipeline”) and paste the URL of the endpoint you created in Step 2.
Test your endpoint to ensure it’s working correctly by sending a sample input and verifying the output.

Step 4: Run an Experiment

Go to the experiments page and choose an existing experiment or create a new one.
Select your model pipeline that you just created.
A/B Test Parameters: Create experimental cohorts to test different configurations of your pipeline.
- In the demo, two cohorts are created: one using a ‘random’ retrieval strategy and one using an ‘embeddings’ retrieval strategy. These will be used as inputs to the API and then parsed by your code.
Evaluation Criteria: Select your evaluation criteria. In the demo, the focus is on “context relevance,” which measures how well the retrieved context matches the ground truth context in your labeled data.
Run the experiment. RagMetrics will send each question in your labeled data set to your endpoint with the A/B parameters, and then evaluate the output according to the selected criteria.

Step 5: Analyze the Results

After the experiment completes, you will see the results for each experimental cohort.
The results will show how well the system performed according to your evaluation criteria.
- In the demo, the “embeddings” retrieval strategy scored higher than the “random” retrieval strategy on context relevance.
- For example, the random retrieval might bring back the policy about shipping when the user asks about a coffee maker and the embeddings retrieval will bring back the correct policy.

Key Takeaways

The RagMetrics API allows you to test your LLM application, including the retrieval step, by sending JSON inputs to your own endpoint and evaluating the output.
You can use A/B testing parameters to switch between different parts of your code, such as different retrieval strategies.
RagMetrics provides an automated way to evaluate the performance of your application based on custom criteria.
You can use this to make data-driven decisions about optimizing your entire LLM pipeline.

Example LLM Pipeline Code Snippet (Python)

This code illustrates a minimal RAG pipeline that takes a question, retrieves a context from a knowledge base and generates an answer. Retrieval is either random or based on embedding similarity, depending on an input parameter switch.

You can access this notebook on Colab. However, please note that Colab does not allow incoming HTTP requests. Therefore, we deployed it on Replit.

from flask import Flask, request, jsonify
from openai import OpenAI
import numpy as np
import random, json

# Initialize the OpenAI client
client = OpenAI(api_key='&lt;Your OpenAI Key&gt;')

# Our knowledge base: a list of sentences
knowledge_base = [
    "Our standard shipping policy states that delivery times are typically within 5-7 business days, but may be extended during periods of high demand. - Company Shipping Policy Document",
    "Our return policy ensures that customers can return incorrect items at no cost, and we will promptly send the correct item. - Company Return Policy Document",
    "All appliances purchased from our store come with a 12-month warranty covering manufacturing defects and malfunctions. - Company Warranty Policy Document",
    "Returns without a receipt can be processed for store credit or full refund if the purchase is linked to a customer account. - Company Return Policy Document",
    "International shipping is available for most countries, with standard delivery times ranging from 10-15 business days, subject to customs clearance. - Company International Shipping Policy Document",
]

def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response.data[0].embedding


def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def retrieval_random(query, knowledge_base):
    return [random.choice(knowledge_base)]


def retrieval_embed(query, knowledge_base):
    query_embedding = get_embedding(query)


    similarities = []
    for sentence in knowledge_base:
        sentence_embedding = get_embedding(sentence)
        similarity = cosine_similarity(query_embedding, sentence_embedding)
        similarities.append(similarity)


    most_relevant_index = np.argmax(similarities)
    relevant_sentences = [knowledge_base[most_relevant_index]]


    return relevant_sentences


def generation(context, query):
    prompt = f"""You provide customer service for an online retailer. Answer the customer question based on the context only.


    Question: {query}


    Context: {' '.join(context)}


    Answer:"""


    llm_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.7
    )


    answer = llm_response.choices[0].message.content


    return answer, llm_response


def simple_rag(json_input):
    #Parse input parameters
    try:
        user_question = json_input['row']['question']
        if 'retrieval_strategy' in json_input['experiment']:
            retrieval_strategy = json_input['experiment']['retrieval_strategy']
        else:
            retrieval_strategy = 'random'
    except KeyError:
        return {"error": "Missing json_input['row']['question']"}


    #Run pipeline
    if retrieval_strategy == 'random':
        contexts = retrieval_random(user_question, knowledge_base)
    elif retrieval_strategy == 'embeddings':
        contexts = retrieval_embed(user_question, knowledge_base)
    else:
        return {"error": "Invalid json_input['row']['retrieval_strategy']"}


    answer, llm_response = generation(contexts, user_question)


    #Write API response
    output_json = {
        "generated_answer": answer,
        "contexts": [
            {
                "metadata": {
                    "source": "knowledge_base",
                    "id": knowledge_base.index(context)
                },
                "page_content": context
            } for context in contexts
        ],
        "llm_response": llm_response.model_dump() if llm_response else None
    }
    return output_json


# Flask app setup
app = Flask(__name__)


@app.after_request
def after_request(response):
    origin = request.headers.get('Origin')
    if origin:
        response.headers['Access-Control-Allow-Origin'] = origin
    response.headers['Access-Control-Allow-Methods'] = 'GET, POST, OPTIONS'
    response.headers['Access-Control-Allow-Headers'] = 'Content-Type, X-CSRFToken'
    response.headers['Access-Control-Allow-Credentials'] = 'true'
    return response


@app.route('/', methods=['OPTIONS'])
def options():
    response = jsonify({})
    response.headers['Access-Control-Max-Age'] = '86400'
    return response


@app.route('/', methods=['POST'])
def post_handler():
    post_data = request.get_json()
    response_data = simple_rag(post_data)
    return jsonify(response_data)


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)
    # sample_input = {
    #    "experiment": {
    #        "experiment_name": "experiment 1",
    #        "retrieval_strategy": "random"
    #    },
    #    "row": {
    #        "q_num": 1,
    #        "question": "international"
    #    }
    # }


    # #Call simple_rag with the sample input
    # result = simple_rag(sample_input)
    # print(json.dumps(result, indent=4))

Next Steps

Try implementing different retrieval strategies in your code.
Experiment with different evaluation criteria to measure other aspects of your LLM application.
Explore the full range of RagMetrics features for more advanced evaluation and optimization.
Drop RagMetrics a line to try out the feature.

This quickstart should give you a clear understanding of how to use the RagMetrics API to improve your LLM applications.

Push API

This API allows you to use your local Python code to push data to RagMetrics, trigger experiments, upload datasets, create criteria and more.

Getting Started: Log a trace

Step 1: Get a RagMetrics key

Log into RagMetrics
Navigate to the Keys page
Create a new RagMetrics API key
Copy the key to your clipboard

Keys Page

Step 2: Log a trace

Use Python code such as below to log a trace of your LLM input and output to RagMetrics

#!pip install ragmetrics-client openai

import ragmetrics
from openai import OpenAI
ragmetrics.login(key="[your_ragmetrics_key]")

openai_client = OpenAI(api_key="[your_openai_key]")
ragmetrics.monitor(openai_client)
resp = openai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "1+1=?"}]
)
print(resp)

Notes:

Paste your RagMetrics API key into the key parameter of the ragmetrics.login() method or set os.environ["RAGMETRICS_API_KEY"] before calling login().
Call monitor() once and then subsequent LLM calls from this client will be logged.
The code above demonstrates how to monitor the OpenAI client, but we also support LiteLLM and LangChain clients using the same syntax.

You can now see the trace on the Traces page

First Trace

Click the triangle to expand the trace and see the details. First Trace Expanded

Step 3: Add metadata

You can add metadata to your traces to help you filter them later:

resp = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "2*2=?"}],
    metadata={"task": "multiplication"}
)

The metadata will show up in the trace details: Second Trace Multiplication

Now you can use the GUI to filter traces by metadata: Trace Filter

Note:

You can also add metadata in ragmetrics.login().
Metadata from individual traces will supersede metadata from the login call.

Step 4: Format trace input/output

You can customize the formatting of the trace input and output using a callback. In the example below, we filter out the part of the output before the equal (=) sign.

def my_callback(raw_input, raw_output):
    # Your custom post-processing logic here. For example:
    try:
        raw_output_str = str(raw_output.choices[0].message.content)
        output = raw_output_str.split('=')[1].strip()
    except Exception as e:
        output = raw_output
    return {
         "output": output
    }

ragmetrics.monitor(openai_client, callback=my_callback)

resp = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "2*2=?"}],
    metadata={"task": "multiplication"}
)

Here’s the new trace: Trace Callback

Notes:

Callbacks do not erase the original raw input and output. Those are still available in the trace, in addition to the processed input and output.
A custom callback accepts two arguments: raw_input and raw_output. These contain the ‘raw’ JSON from the LLM call.
The callback must return a dictionary with optional keys input and output.
If no callback is provided, our default callback flattens input messages into a string, and extracts the string content of the last output message.

Step 5: Trace an agent with OpenAI Agents SDK

RagMetrics supports the OpenAI Agents SDK! In the example below, we trace an agent that calls a tool.

import ragmetrics
from agents import Agent, Runner, function_tool


from dotenv import load_dotenv
load_dotenv('.env')
# .env file contains the two necessary keys:
# RAGMETRICS_API_KEY
# OPENAI_API_KEY

# 1. Configure RagMetrics and monitor the agents Runner
ragmetrics.login()
ragmetrics.monitor(Runner)

# 2. Define your tools and agents as usual
@function_tool
def get_weather(city: str) -> str:
    return f"Sunny."

agent = Agent(
    name="WeatherAgent",
    instructions="Answer weather queries in full sentences, repeat the city name",
    tools=[get_weather]
)

# 3. Run the agent; RagMetrics will capture the trace
result = Runner.run_sync(agent, "What's the weather in Berlin?")
print(result.final_output)

The trace looks like this:

OpenAI Agent Trace

A few notes:

All requests and responses are automatically grouped under the same conversation_id, so you can see them together. Conversation ids can be reset manually with ragmetrics.new_conversation(). You can even pass in your own conversastion id to group multiple traces together. ragmetrics.new_conversation(id: Optional[str] = None)
Requests are on the left while responses are on the right
Each request or response is tagged automatically. The tags are:

Requests

Responses

User

Tool Selection

Tool Call

Tool Response

Assistant Reply

Other

Other
The Other tag is used for spans produced by the OpenAI Agent SDK that do not have input or output, such as the last span, which indicates that the Weather agent has completed. Handoffs are simmilarly tagged.

Key Takeaways

In this quickstart tutorial, you learned how use the RagMetrics push API to:

Log a trace and view it on the RagMetrics GUI
Add metadata and filter on it
Add custom formatting to your trace input and output using a callback
Trace an agent using the OpenAI Agents SDK

Run an experiment

You can create and run an experiment from code, rather than from the GUI. In RagMetrics, an experiment depends on a task and dataset, so we’ll need to create those first.

Step 1: Create a task

import ragmetrics
from ragmetrics import Task

ragmetrics.login(key="[your_ragmetrics_key]")

t1 = Task(name="Pig Latin", system_prompt="Answer in pig latin", generator_model="gpt-4o-mini")
t1.save()
print(t1.id)

After the task is saved, the task object gets a task ID. You can now see it on the Tasks page:

Create Task

You can also download a task from RagMetrics using its ID or name:

t2 = Task.download(t1.id)
# t2 = Task.download(name="Pig Latin")
print("Task name:", t2.name)
print("System prompt:", t2.system_prompt)
print("Generator model:", t2.generator_model)

Step 2: Create a dataset

A RagMetrics dataset is a collection of examples. Each example contains a question, a ground truth answer, and a ground truth context.

from ragmetrics import Dataset, Example

# Create two examples
e1 = Example(
    question="What is the biggest city in the east of US?",
    ground_truth_answer="NYC",
    ground_truth_context=[
        "NYC is the biggest city in the east of US.",
        "NYC is on the eastern seaboard."
    ])
e2 = Example(
    question="Is it beautiful",
    ground_truth_answer="Yes",
    ground_truth_context=["NYC is a very beautiful city"]
)

# Create a dataset and save it.
d1 = Dataset(name="NYC", examples = [e1, e2])
d1.save()
print(d1.id)

Just like with tasks, you can download a dataset from RagMetrics using its ID or name:

d2 = Dataset.download(d1.id)
# d2 = Dataset.download(name="NYC")
print("Dataset name:", d2.name)

Step 3: Create an experiment

Now that we have a task and dataset, let’s create the experiment and run it:

A RagMetrics experiment consists of a list of cohorts. Each cohort represents one run through the dataset. Having two cohorts allows us to compare performance between two models, prompts or LLM pipelines, similar to an A/B test. The number of cohorts in an experiment is unlimited.

from ragmetrics import Task, Dataset, Criteria, Experiment, Cohort

# Load pre-requisites: the task, dataset and criteria
t1 = Task.download(name="Pig Latin")
d1 = Dataset.download(name="NYC")

# We'll use the off-the-shelf accuracy criterion, but can also create our own
accuracy = Criteria(name = "Accuracy")

# Define the experiment as a list of cohorts
cohort1 = Cohort(name="gpt-4o-mini", generator_model="gpt-4o-mini")
cohort2 = Cohort(name="API Demo: Stub", rag_pipeline="API Demo: Stub")
e1 = Experiment(
            name="My Experiment",
            dataset=d1,
            task=t1,
            cohorts=[cohort1,cohort2],
            criteria=[accuracy],
            judge_model="gpt-4o-mini"
        )

# Run the experiment
experiment_results = e1.run()

When the experiment runs, we can see the following in the console: Experiment Progress

Follow the link to view the experiment in the RagMetrics UI.

Step 4: Local code

In the previous step, we ran an experiment whose assets are hosted by on RagMetrics servers. By ‘assets’ we mean the task, dataset, criteria and the generator model. All of these can be hosted locally in your environment too:

Let’s start by with a local pipeline that greets our users:

def say_hi(input, cohort = None):
    return "Hi " + input

For simplicity, this example does not use an LLM, but it just as easily could.

This function takes two parameters:

input: The input, typically provided by the user, into your LLM pipeline.
cohort: Optional JSON of A/B test switches. You can use them to compare prompts, generator models, vector databases, agentic configurations etc. More on those later.

Ragmetrics expects local functions such as say_hi to return either a string (which would be the generated answer) or a dictionary with the two keys:

generated_answer: The output of your LLM pipeline. Can be a string or JSON.

contexts: Optional, list of contexts retrieved. Useful if you’d like to evaluate retrieval separately from generation. Here’s an example:

contexts = [
    {
        "metadata": {
            "source": "Source 1",
            "title": "Title for Source 1",
            # Additional metadata as needed to identify the source
        },
        "page_content": "Content for source 1"
    },
    {
        "metadata": {
            "source": "Source 2",
            "title": "Title for Source 2",
            # Additional metadata as needed to identify the source
        },
        "page_content": "Content for source 2"
    }
]

Now that we have a local “LLM” pipeline, let’s create the other assets locally too: a task, dataset and an experiment:

e1 = Example(question="Bob", ground_truth_answer="Hi Bob")
e2 = Example(question="Jane", ground_truth_answer="Hi Jane")
dataset1 = Dataset(examples = [e1, e2], name="Names")
task1 = Task(name="Greet", function=say_hi)
criteria1 = Criteria(name = "Accuracy")

exp1 = Experiment(
            name="Naming Experiment",
            dataset=dataset1,
            task=task1,
            criteria=[criteria1],
            judge_model="gpt-4o-mini"
        )
status = exp1.run()

When we run the experiment, RagMetrics will loop through the local function, once for every example in the dataset, collect the outputs and send them to the RagMetrics server for evaluation. We can watch the progress in the terminal or web UI, same as in step 3 above.

Notes:

The Task refers to our local pipeline using the function parameter.
We did not specify Cohorts, because we are not using any A/B test switches in this example. Ragmetrics creates a default Cohort for us.

Step 5: Optimize local code

In the previous step, we ran an experiment on a local function called say_hi. Now let’s optimize the greeting. We are considering two greetings, “Hi” and “Hello”. We will use Cohorts to decide which greeting is more accurate. A cohort is a set of switches to facilitate A/B testing. In this example, we’ll use the greeting switch to switch between the two greetings, “Hi” and “Hello”.

import ragmetrics
from ragmetrics import Task, Dataset, Criteria, Experiment, Cohort, Example

# Now we are using cohorts inside the local function
def say_hi(input, cohort=None):
    output = f"{cohort.greeting}, {input}!"
    return output

# --- 1. RagMetrics Login ---
ragmetrics.login(key="<your_ragmetrics_key>")

# --- 2. Create the dataset ---
examples = [
    Example(question="Alice", ground_truth_answer="Hi, Alice!"),
    Example(question="Bob", ground_truth_answer="Hello, Bob!"),
]
dataset = Dataset(examples=examples, name="Greeting Data")

# --- 3. Define Cohorts ---
cohort_hi = Cohort(name="Hi Cohort", greeting="Hi")
cohort_hello = Cohort(name="Hello Cohort", greeting="Hello")

# --- 4. Define the Task and the criteria ---
task = Task(name="Greeting Task", function=say_hi)
accuracy = Criteria.download(name="Accuracy")

# --- 5. Run the experiment ---
experiment = Experiment(
    name="Greeting Experiment",
    dataset=dataset,
    task=task,
    criteria=[accuracy],
    cohorts=[cohort_hi, cohort_hello],
    judge_model="gpt-4o-mini"
)
experiment.run()

As the experiment runs, RagMetrics will loop through the local function, once for every example in the dataset, collect the outputs and send them to the RagMetrics server for evaluation. We can watch the progress in the terminal or web UI, same as in step 3 above. Here are the results:

Local Function Cohorts Result

Note that our accuracy scores alternate between a ‘5’ when we get the greeting right and a ‘4’ when we get it almost right, ie “Hi” instead of “Hello”.

Key Takeaways

In this quickstart tutorial, you learned how use the RagMetrics push API to:

Create tasks and datasets
Create and run experiments
Review their results

Requests	Responses
User	Tool Selection
Tool Call	Tool Response
	Assistant Reply
Other	Other