Using the API

RagMetrics offers two APIs:

  1. Pull API: RagMetrics will pull data from your REST endpoint and evaluate the results. Experiment runs are triggered from the RagMetrics GUI.

  2. Push API: Your local Python code will push data to RagMetrics. Experiment runs are triggered from your Python code.


Pull API

In this API, RagMetrics connects to your code through a REST endpoint, provides input as if from a user, waits for your code to process it, collects the output and retrieved contexts and evaluates the results.

Developer Quickstart Video

Click to watch a video demo


Step 1: Understand the Data Format

  • The RagMetrics API uses JSON for both input and output.

  • Labeled Data: Your labeled data set should include:

    • Sample inputs or user questions that you would expect for your conversational AI bot.

    • The correct answers that the system should deliver.

    • Correct contexts or sources that are relevant to the answers.

  • The API will take this data, one input at a time, and feed it into your pipeline.

  • Input JSON: The input to your pipeline via the API should include:

    • The question from your labeled data set.

    • Any A/B test parameters that you want to use to switch between different parts of your code, like retrieval strategies. These are optional.

  • Output JSON: Your pipeline should return a JSON object that includes:

    • The generated answer from your LLM. This is the only required field.

    • The retrieved contexts (optional), which is a list of contexts with metadata and content.

    • The LLM response (optional).


Step 2: Set Up Your Development Environment

  • You can use any environment you like, but this demo uses Replit, a free online coding platform that allows you to set up a web server.

  • Knowledge Base: Create a list of strings that represent your knowledge base. In this demo, the knowledge base is a list of customer service policies copied from an Excel spreadsheet.

  • RAG Method: Create a function (e.g., rag in the demo) that takes a JSON input and returns a JSON output, as defined in Step 1. This method should implement the logic of your LLM pipeline.

  • Parsing Input: In your rag method, parse the input JSON to get:

    • The question (e.g., from the “row” element).

    • Any A/B test parameters (e.g., the “retrieval strategy”).

  • Retrieval Logic: Implement your retrieval logic. In the demo, the retrieval strategy is switched between ‘random’ (retrieving a random context from the knowledge base) and ‘embeddings’ (retrieving a context based on embeddings).

  • Response: Format the output as a JSON with at least the generated_answer field. You can also add the optional contexts and llm_response fields.

  • Create an Endpoint: Use Replit or a similar platform to set up an endpoint for your LLM pipeline.


Step 3: Connect Your Endpoint to RagMetrics

  • In the RagMetrics platform, go to the API demo section and create a new “rag endpoint”.

  • Give your endpoint a name (e.g., “my LLM pipeline”) and paste the URL of the endpoint you created in Step 2.

  • Test your endpoint to ensure it’s working correctly by sending a sample input and verifying the output.


Step 4: Run an Experiment

  • Go to the experiments page and choose an existing experiment or create a new one.

  • Select your model pipeline that you just created.

  • A/B Test Parameters: Create experimental cohorts to test different configurations of your pipeline.

    • In the demo, two cohorts are created: one using a ‘random’ retrieval strategy and one using an ‘embeddings’ retrieval strategy. These will be used as inputs to the API and then parsed by your code.

  • Evaluation Criteria: Select your evaluation criteria. In the demo, the focus is on “context relevance,” which measures how well the retrieved context matches the ground truth context in your labeled data.

  • Run the experiment. RagMetrics will send each question in your labeled data set to your endpoint with the A/B parameters, and then evaluate the output according to the selected criteria.


Step 5: Analyze the Results

  • After the experiment completes, you will see the results for each experimental cohort.

  • The results will show how well the system performed according to your evaluation criteria.

    • In the demo, the “embeddings” retrieval strategy scored higher than the “random” retrieval strategy on context relevance.

    • For example, the random retrieval might bring back the policy about shipping when the user asks about a coffee maker and the embeddings retrieval will bring back the correct policy.


Key Takeaways

  • The RagMetrics API allows you to test your LLM application, including the retrieval step, by sending JSON inputs to your own endpoint and evaluating the output.

  • You can use A/B testing parameters to switch between different parts of your code, such as different retrieval strategies.

  • RagMetrics provides an automated way to evaluate the performance of your application based on custom criteria.

  • You can use this to make data-driven decisions about optimizing your entire LLM pipeline.


Example LLM Pipeline Code Snippet (Python)

This code illustrates a minimal RAG pipeline that takes a question, retrieves a context from a knowledge base and generates an answer. Retrieval is either random or based on embedding similarity, depending on an input parameter switch.

You can access this notebook on Colab. However, please note that Colab does not allow incoming HTTP requests. Therefore, we deployed it on Replit.

from flask import Flask, request, jsonify
from openai import OpenAI
import numpy as np
import random, json

# Initialize the OpenAI client
client = OpenAI(api_key='<Your OpenAI Key>')

# Our knowledge base: a list of sentences
knowledge_base = [
    "Our standard shipping policy states that delivery times are typically within 5-7 business days, but may be extended during periods of high demand. - Company Shipping Policy Document",
    "Our return policy ensures that customers can return incorrect items at no cost, and we will promptly send the correct item. - Company Return Policy Document",
    "All appliances purchased from our store come with a 12-month warranty covering manufacturing defects and malfunctions. - Company Warranty Policy Document",
    "Returns without a receipt can be processed for store credit or full refund if the purchase is linked to a customer account. - Company Return Policy Document",
    "International shipping is available for most countries, with standard delivery times ranging from 10-15 business days, subject to customs clearance. - Company International Shipping Policy Document",
]

def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response.data[0].embedding


def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def retrieval_random(query, knowledge_base):
    return [random.choice(knowledge_base)]


def retrieval_embed(query, knowledge_base):
    query_embedding = get_embedding(query)


    similarities = []
    for sentence in knowledge_base:
        sentence_embedding = get_embedding(sentence)
        similarity = cosine_similarity(query_embedding, sentence_embedding)
        similarities.append(similarity)


    most_relevant_index = np.argmax(similarities)
    relevant_sentences = [knowledge_base[most_relevant_index]]


    return relevant_sentences


def generation(context, query):
    prompt = f"""You provide customer service for an online retailer. Answer the customer question based on the context only.


    Question: {query}


    Context: {' '.join(context)}


    Answer:"""


    llm_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.7
    )


    answer = llm_response.choices[0].message.content


    return answer, llm_response


def simple_rag(json_input):
    #Parse input parameters
    try:
        user_question = json_input['row']['question']
        if 'retrieval_strategy' in json_input['experiment']:
            retrieval_strategy = json_input['experiment']['retrieval_strategy']
        else:
            retrieval_strategy = 'random'
    except KeyError:
        return {"error": "Missing json_input['row']['question']"}


    #Run pipeline
    if retrieval_strategy == 'random':
        contexts = retrieval_random(user_question, knowledge_base)
    elif retrieval_strategy == 'embeddings':
        contexts = retrieval_embed(user_question, knowledge_base)
    else:
        return {"error": "Invalid json_input['row']['retrieval_strategy']"}


    answer, llm_response = generation(contexts, user_question)


    #Write API response
    output_json = {
        "generated_answer": answer,
        "contexts": [
            {
                "metadata": {
                    "source": "knowledge_base",
                    "id": knowledge_base.index(context)
                },
                "page_content": context
            } for context in contexts
        ],
        "llm_response": llm_response.model_dump() if llm_response else None
    }
    return output_json


# Flask app setup
app = Flask(__name__)


@app.after_request
def after_request(response):
    origin = request.headers.get('Origin')
    if origin:
        response.headers['Access-Control-Allow-Origin'] = origin
    response.headers['Access-Control-Allow-Methods'] = 'GET, POST, OPTIONS'
    response.headers['Access-Control-Allow-Headers'] = 'Content-Type, X-CSRFToken'
    response.headers['Access-Control-Allow-Credentials'] = 'true'
    return response


@app.route('/', methods=['OPTIONS'])
def options():
    response = jsonify({})
    response.headers['Access-Control-Max-Age'] = '86400'
    return response


@app.route('/', methods=['POST'])
def post_handler():
    post_data = request.get_json()
    response_data = simple_rag(post_data)
    return jsonify(response_data)


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)
    # sample_input = {
    #    "experiment": {
    #        "experiment_name": "experiment 1",
    #        "retrieval_strategy": "random"
    #    },
    #    "row": {
    #        "q_num": 1,
    #        "question": "international"
    #    }
    # }


    # #Call simple_rag with the sample input
    # result = simple_rag(sample_input)
    # print(json.dumps(result, indent=4))

Next Steps

  • Try implementing different retrieval strategies in your code.

  • Experiment with different evaluation criteria to measure other aspects of your LLM application.

  • Explore the full range of RagMetrics features for more advanced evaluation and optimization.

  • Drop RagMetrics a line to try out the feature.

This quickstart should give you a clear understanding of how to use the RagMetrics API to improve your LLM applications.



Push API

This API allows you to use your local Python code to push data to RagMetrics, trigger experiments, upload datasets, create criteria and more.

Getting Started: Log a trace

Step 1: Get a RagMetrics key

  • Log into RagMetrics

  • Navigate to the Keys page

  • Create a new RagMetrics API key

  • Copy the key to your clipboard

Keys Page


Step 2: Log a trace

Use Python code such as below to log a trace of your LLM input and output to RagMetrics

#!pip install ragmetrics-client openai

import ragmetrics
from openai import OpenAI
ragmetrics.login(key="[your_ragmetrics_key]")

openai_client = OpenAI(api_key="[your_openai_key]")
ragmetrics.monitor(openai_client)
resp = openai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "1+1=?"}]
)
print(resp)

Notes:

  • Paste your RagMetrics API key into the key parameter of the ragmetrics.login() method or set os.environ["RAGMETRICS_API_KEY"] before calling login().

  • Call monitor() once and then subsequent LLM calls from this client will be logged.

  • The code above demonstrates how to monitor the OpenAI client, but we also support LiteLLM and LangChain clients using the same syntax.

You can now see the trace on the Traces page

Traces Page


Click the triangle to expand the trace and see the details. Trace Expanded


Step 3: Add metadata

You can add metadata to your traces to help you filter them later:

resp = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "2*2=?"}],
    metadata={"task": "multiplication"}
)

The metadata will show up in the trace details: Second Trace, Multiplication


Now you can use the GUI to filter traces by metadata: Trace Filter

Note:

  • You can also add metadata in ragmetrics.login().

  • Metadata from individual traces will supersede metadata from the login call.


Step 4: Format trace input/output

You can customize the formatting of the trace input and output using a callback. In the example below, we filter out the part of the output before the equal (=) sign.

def my_callback(raw_input, raw_output):
    # Your custom post-processing logic here. For example:
    try:
        raw_output_str = str(raw_output.choices[0].message.content)
        output = raw_output_str.split('=')[1].strip()
    except Exception as e:
        output = raw_output
    return {
         "output": output
    }

ragmetrics.monitor(openai_client, callback=my_callback)

resp = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "2*2=?"}],
    metadata={"task": "multiplication"}
)

Here’s the new trace: Custom Format for Trace


Notes:

  • Callbacks do not erase the original raw input and output. Those are still available in the trace, in addition to the processed input and output.

  • A custom callback accepts two arguments: raw_input and raw_output. These contain the ‘raw’ JSON from the LLM call.

  • The callback must return a dictionary with optional keys input and output.

  • If no callback is provided, our default callback flattens input messages into a string, and extracts the string content of the last output message.


Key Takeaways

In this quickstart tutorial, you learned how use the RagMetrics push API to:

  • Log a trace and view it on the RagMetrics GUI

  • Add metadata and filter on it

  • Add custom formatting to your trace input and output using a callback


Run an experiment

You can create and run an experiment from code, rather than from the GUI. In RagMetrics, an experiment depends on a task and dataset, so we’ll need to create those first.


Step 1: Create a task

from ragmetrics import Task

t1 = Task(name="Pig Latin", system_prompt="Answer in pig latin", generator_model="gpt-4o-mini")
t1.save()
print(t1.id)

After the task is saved, the task object gets a task ID. You can now see it on the Tasks page:

Pig Latin Task

You can also download a task from RagMetrics using its ID or name:

t2 = Task.download(t1.id)
# t2 = Task.download(name="Pig Latin")
print("Task name:", t2.name)
print("System prompt:", t2.system_prompt)
print("Generator model:", t2.generator_model)

Step 2: Create a dataset

A RagMetrics dataset is a collection of examples. Each example contains a question, a ground truth answer, and a ground truth context.

from ragmetrics import Dataset, Example

# Create two examples
e1 = Example(
    question="What is the biggest city in the east of US?",
    ground_truth_answer="NYC",
    ground_truth_context=[
        "NYC is the biggest city in the east of US.",
        "NYC is on the eastern seaboard."
    ])
e2 = Example(
    question="Is it beautiful",
    ground_truth_answer="Yes",
    ground_truth_context=["NYC is a very beautiful city"]
)

# Create a dataset and save it.
d1 = Dataset(name="NYC", examples = [e1, e2])
d1.save()
print(d1.id)

Just like with tasks, you can download a dataset from RagMetrics using its ID or name:

d2 = Dataset.download(d1.id)
# d2 = Dataset.download(name="NYC")
print("Dataset name:", d2.name)

Step 3: Create an experiment

Now that we have a task and dataset, let’s create the experiment and run it:

A RagMetrics experiment consists of a list of cohorts. Each cohort represents one run through the dataset. Having two cohorts allows us to compare performance between two models, prompts or LLM pipelines, similar to an A/B test. The number of cohorts in an experiment is unlimited.

from ragmetrics import Task, Dataset, Criteria, Experiment

# Load pre-requisites: the task, dataset and criteria
t1 = Task.download(name="Pig Latin")
d1 = Dataset.download(name="NYC")

# We'll use the off-the-shelf accuracy criterion, but can also create our own
accuracy = Criteria(name = "Accuracy")

# Define the experiment as a list of cohorts
cohort1 = Cohort(name="gpt-4o-mini", generator_model="gpt-4o-mini")
cohort2 = Cohort(name="API Demo: Stub", rag_pipeline="API Demo: Stub")
e1 = Experiment(
            name="My Experiment",
            dataset=d1,
            task=t1,
            cohorts=[cohort1,cohort2],
            criteria=[accuracy],
            judge_model="gpt-4o-mini"
        )

# Run the experiment
experiment_results = exp_models.run()

When the experiment runs, we can see the following in the console: Experiment Progress Bar

Follow the link to view the experiment in the RagMetrics UI.


Key Takeaways

In this quickstart tutorial, you learned how use the RagMetrics push API to:

  • Create tasks and datasets

  • Create and run experiments

  • Review their results