Using the GUI

Quickstart: Run an experiment

This tutorial will guide you through setting up your first experiment on the RagMetrics platform using the web interface, with no code. You’ll learn how to connect your LLM service, create a labeled dataset, and evaluate your model, all in a few simple steps.

Click to watch a video demo

Introduction:

There are 5 steps to run an experiment and this guide will take you step by step.
For a successful test, you need an LLM key that you can obtain from OpenAI, Gemini, or any other LLM Foundational Model. NOTE: Your key has to have enough token credits to run the experiments.

Step 1: Connect Your LLM Service

Begin by linking RagMetrics to your LLM provider. For this tutorial, we’ll use OpenAI.
- Click on Keys, then add a key.
- Click on OpenAI.
- Provide your OpenAI key. NOTE: to get an OpenAI key, go to OpenAI Website, LogIn and access API Keys. If you have questions, check step-by-step instructions Here Once you enter your key you can test it. The key status should indicate if the key is properly configured.
- NOTE: please remember that the keys have usage quotas associated with them. If your balance is not enough you will get an Error and the test or task will fail.

Step 2: Define Your Task

Next, specify the task you’d like your LLM to perform.
- Create a Task
- Add Task Name
- Select the LLM Model: Use the model that corresponds to your KEY added in Step 1. We have pre-configured integrations for several models including GPT-4, Hugging Face, or other LLM providers. As soon as new models hit the market we will include them in our system.
- Add a SYSTEM PROMPT, that would be the default prompt for your task. NOTE: It can be blank, or a basic prompt like “you are a helpful AI assistant”
EXAMPLE: In this example, we will create a New York City Tour Guide.
- Your LLM will answer questions about New York City.
- The Task prompt could be “You are a helpful tourist guide to New York City”

Step 3: Create a Labeled Dataset

The Dataset is a key component of the task as it is the source of the Experiment. A labeled dataset is essential for evaluating your LLM’s performance.
You can create a dataset in 3 ways:
1. Generated from CSV: You can upload a CSV file. The file has a specific format with 3 columns: Question, Answer, Context. You can download the CSV template right from the application. You then select the LLM model that you want to use to process the content and the number of questions you want the model to use for testing. NOTE: Default is 10
2. Generated from Reference Docs: The reference docs could be a link (i.e. NYC Guide) or a document with specific content that you want to use as reference. You then select the LLM model that you want to use to process the content and the number of questions you want the model to use for testing. NOTE: Default is 10
3. Generated Manually: Finally, you can also enter questions and answers manually, by adding them line by line.
For our EXAMPLE, we’ll load reference documents from New York City’s Wikipedia page as shown above.
Once the Dataset is loaded, RagMetrics will automatically:
1. Read the documents.
2. Chunk the documents.
3. Parse the documents.
4. Embed the documents.
5. Generate 10 questions based on the content.
6. Generate 10 corresponding answers.
7. Identify chunks of context within the documents that back up each answer.
Here’s an example of possible questions generated:
Question: “How has the location of New York City at the mouth of the Hudson River contributed to its growth as a trading post?”
Ground Truth Answer: “The location of New York City at the mouth of the Hudson River has contributed to its growth as a trading port because it provides a naturally sheltered harbor and easy access to the Atlantic Ocean”.
Ground Truth Context: A specific excerpt from the Wikipedia page that contains the ground truth answer.

Step 4: Evaluate Your Model

Now it’s time to create EXPERIMENTS to evaluate how well your model performs on the created dataset.
Create an Experiment:

Enter the Experiment Name
Select the TASK to use. NOTE: The Task to use should have been defined in Step 2
Select the DATASET to use: NOTE: the Dataset to use should have been defined in Step 3
Select the Experiment type: there are 2 types of experiments:
Compare Models: you can select a group of LLM Foundational models to compare your task and choose which one might be the best for what you are looking for. It is interesting to see how the different models behave with different metrics.
Compare Prompts: you can enter 2 or more prompts and compare the outcome of those. This is helpful to improve your prompt development.
Select the Evaluation Criteria: You can select criteria for Retrieval or for Generation. There are specific Criteria for each of the two categories. You can pick any number of criteria from a database of over 200, or you can create your own criteria by selecting Add New Criteria. Each criterion is evaluated either True or False or in a rank from 1 to 5 (5 being better)

As an example let’s select “ACCURACY” and “SUCCINCTNESS”
Run the EXPERIMENT:
- Generate answers to each of the questions in your dataset.
- Compare the generated answers to the Truth answers provided.
- Assess the context retrieval to see if the correct contexts were injected into the prompt.
- Provide results for each of the metrics.
- By clicking on each run number of the experiment you can see the details of each question and how it was rated.
When you run the evaluation, RagMetrics will:
- Generate answers to each of the 10 questions in your dataset.
- Compare the generated answers to the gold answers.
- Assess the context retrieval to see if the correct contexts were injected into the prompt.
By clicking on each run number of the experiment you can see the details of each question and how it was rated.

Step 5: Analyze the Results

Once the evaluation is complete, you’ll see an overview of the results. You can get into each question’s results and understand the reasoning why it was rated one way or another.
The results will show:
- Context precision and gold rank scores, indicating how well the system retrieved the correct context.
- Answer quality score, indicating how well the model answered the questions.
- The cost of the evaluation.
- Downloadable details for a question-by-question breakdown.

Step 6: Improve Your Model (Optional)

If you’re not satisfied with the results, you can make changes and re-evaluate.
For example, you could switch to a different model. As an example, you can switch the LLM Foundational Model from GPT 3.5 Turbo to GPT-4.
You can then re-run the evaluation and see if there is any improvement.
LLM configurations
After the new evaluation, you can review the new results and analyze the differences.

Key Takeaways

RagMetrics provides an easy way to test your LLM applications using labeled data.
The platform automates the evaluation process and provides detailed insights that help you to make data-driven decisions to improve your application.
You can test different models and prompts.
You can evaluate the quality of context retrieval.

Next Steps

Try other evaluation criteria.
Try testing your own data and prompts.
Book a demo with us to learn more.

This tutorial should provide you with a solid starting point for using RagMetrics to improve the quality of your LLM applications.

Monitor your LLM application

After you deploy your LLM application to production, use RagMetrics to monitor how your LLM pipeline is responding to your users:

Step 1: Log traces

First, add two lines to your Python code to log traces to RagMetrics: Getting Started: Log a trace (see API documentation)

Step 2: Create a review queue

In RagMetrics, a review queue is used for evaluating incoming traces both manually (human review) and automatically by LLM judges (online evaluation).

Review Queues, Empty

Navigate to “Reviews”
Click “Create Queue”
Fill out the form. Here are the key fields:
- Condition: A search string. Incoming traces that include this string in either the input, output or metadata will be included in this review queue.
- Criteria: Traces included in this queue will be evaluated according to the criteria you select here.
- Judge Model: The LLM that will judge the traces.
- Dataset: For any trace in the queue, you can fix the actual output to the expected output and store it in a dataset. This dataset can then be used as a regression test, to evaluate your pipeline as you add new features.
Click “Save”

Step 3: Review a trace

Now that you have created a review queue, all traces that match the condition will be reviewed automatically, according to the criteria you selected. Let’s watch the queue in action:

Log a new trace that matches the search condition.
Click on the queue. You should now see the trace there.
Click on the trace

Trace in Review Queue

This is the trace review panel. Human and automatic reviews are shown in the “Scores” panel. The trace is shown in the “Input” and “Output” panels below.

As soon as the trace is logged, it will be automatically evaluated according to the criteria selected in the review queue. If the trace matches more than one queue, all criteria from all queues will be applied. Automated reviews can take a few minutes to complete. As soon as they are complete, the scores will be shown in the “Scores” panel.

Click “Review” to add a (human) review to the trace.

Human Review

You can review the trace overall (pass/fail) or give feedback on one of automated scores. This feedback to the LLM judge can improve automated judgement over time.

Step 3: Set Expected Output

On the trace review page, click “Expected Output” to fix the output to the expected output and store it in a dataset. This dataset can then be used as a regression test, to evaluate your pipeline as you add new features. The dataset selector is on the bottom left of the expected output modal. It defaults to the dataset you selected when you created the review queue.

Expected Output

Note: You will be able to edit the expected output further on the dataset page.

Fix hallucinations

RagMetrics can help you discover and fix hallucinations in your LLM application. We define hallucinations as inconsistencies between the input and output. This feature works best for data extraction and parsing use cases, where the output is generated by parsing the input. To use this feature:

Create a review queue
In the criteria section, expand “Generation Criteria”
Select “Hallucination Detector”
For the Judge Model, we recommend a large reasoning model, such as OpenAI’s o3-mini or DeepSeek’s deepseek-reasoner (make sure you have configured your OpenAI or DeepSeek key in the “Keys” section)
Save the review queue
Log a trace to the queue

The hallucination detector shows a list of possible hallucinations. Click on a hallucinations to relevant highlights from the input and output. Here’s an example:

Hallucination Detector

In this case, we can click “Expected Output” and fix the result from 7,056,652 to 7,006,652. We can then store this in the “multiplication” dataset which can be used to test whether our LLM app is good at multiplication.