Using the GUI

Quickstart: Run an experiment

This tutorial will guide you through setting up your first experiment on the RagMetrics platform using the web interface, with no code. You’ll learn how to connect your LLM service, create a labeled dataset, and evaluate your model, all in a few simple steps.


Developer Quickstart Video

Click to watch a video demo



Step 1: Connect Your LLM Service

  • Begin by linking RagMetrics to your LLM provider. For this tutorial, we’ll use OpenAI.

    • Click on Keys, then add a key.

    • Click on OpenAI.

    • Provide your OpenAI key. Here’s how to get one.


Step 2: Define Your Task

  • Next, specify the task you’d like your LLM to perform.

    • Use pre-configured integrations for GPT-4, Hugging Face, or other LLM providers.

    • Input your API key or custom model endpoint.

  • In this example, we will create a New York City Tour Guide.

    • This means your LLM will answer questions about New York City.


Step 3: Create a Labeled Dataset

  • A labeled dataset is essential for evaluating your LLM’s performance.

  • You can create a dataset by uploading a spreadsheet, but for this tutorial, we will generate one from reference documents.

  • We’ll load reference documents from New York City’s Wikipedia page.

  • RagMetrics will automatically:

    1. Read the documents.

    2. Chunk the documents.

    3. Parse the documents.

    4. Embed the documents.

    5. Generate 10 questions based on the content.

    6. Generate 10 corresponding answers.

    7. Identify chunks of context within the documents that back up each answer.

  • Here’s an example:

    • Question: “How has the location of New York City at the mouth of the Hudson River contributed to its growth as a trading post?”

    • Ground Truth Answer: “The location of New York City at the mouth of the Hudson River has contributed to its growth as a trading port because it provides a naturally sheltered harbor and easy access to the Atlantic Ocean”.

    • Ground Truth Context: A specific excerpt from the Wikipedia page that contains the ground truth answer.


Step 4: Evaluate Your Model

  • Now it’s time to evaluate how well your model performs on the created dataset.

  • Let’s name the evaluation “Accuracy”.

  • Select the evaluation criteria:

    • Context Precision: Measures how often the retrieval model provides the correct context.

    • Gold Rank: Measures how high the correct context is ranked in the retrieved results.

    • Answer Quality: A score of 1 to 3, judged by GPT-4.

  • Run the evaluation. RagMetrics will:

    • Generate answers to each of the 10 questions in your dataset.

    • Compare the generated answers to the gold answers.

    • Assess the context retrieval to see if the correct contexts were injected into the prompt.


Step 5: Analyze the Results

  • Once the evaluation is complete, you’ll see an overview of the results.

  • The results will show:

    • Context precision and gold rank scores, indicating how well the system retrieved the correct context.

    • Answer quality score, indicating how well the model answered the questions.

    • The cost of the evaluation.

    • Downloadable details for a question-by-question breakdown.

  • In this example, the answer quality score is 2.4 out of 3.


Step 6: Improve Your Model (Optional)

  • If you’re not satisfied with the results, you can make changes and re-evaluate.

  • For example, you could switch to a different model. Let’s switch from GPT 3.5 Turbo to GPT-4.

  • You can then re-run the evaluation and see if there is any improvement.

  • LLM configurations

  • After the new evaluation, the context precision and gold rank remain the same, but the answer quality increases from 2.4 to 2.7.


Key Takeaways

  • RagMetrics provides an easy way to test your LLM applications using labeled data.

  • The platform automates the evaluation process and provides detailed insights that help you to make data driven decisions to improve your application.

  • You can test different models and prompts.

  • You can evaluate the quality of context retrieval.

Next Steps

  • Try other evaluation criteria.

  • Try testing your own data and prompts.

  • Book a demo with us to learn more.

This tutorial should provide you with a solid starting point for using RagMetrics to improve the quality of your LLM applications.



Monitor your LLM application

After you deploy your LLM application to production, use RagMetrics to monitor how your LLM pipeline is responding to your users:



Step 1: Log traces

First, add two lines to your Python code to log traces to RagMetrics: Getting Started: Log a trace (see API documentation)



Step 2: Create a review queue

In RagMetrics, a review queue is used for evaluating incoming traces both manually (human review) and automatically by LLM judges (online evaluation).


Review Queues, Empty


  • Navigate to “Reviews”

  • Click “Create Queue”

  • Fill out the form. Here are the key fields:

    • Condition: A search string. Incoming traces that include this string in either the input, output or metadata will be included in this review queue.

    • Criteria: Traces included in this queue will be evaluated according to the criteria you select here.

    • Judge Model: The LLM that will judge the traces.

    • Dataset: For any trace in the queue, you can fix the actual output to the expected output and store it in a dataset. This dataset can then be used as a regression test, to evaluate your pipeline as you add new features.

  • Click “Save”



Step 3: Review a trace

Now that you have created a review queue, all traces that match the condition will be reviewed automatically, according to the criteria you selected. Let’s watch the queue in action:


  • Log a new trace that matches the search condition.

  • Click on the queue. You should now see the trace there.

  • Click on the trace

Trace in Review Queue


This is the trace review panel. Human and automatic reviews are shown in the “Scores” panel. The trace is shown in the “Input” and “Output” panels below.

As soon as the trace is logged, it will be automatically evaluated according to the criteria selected in the review queue. If the trace matches more than one queue, all criteria from all queues will be applied. Automated reviews can take a few minutes to complete. As soon as they are complete, the scores will be shown in the “Scores” panel.

Click “Review” to add a (human) review to the trace.


Human Review


You can review the trace overall (pass/fail) or give feedback on one of automated scores. This feedback to the LLM judge can improve automated judgement over time.



Step 3: Set Expected Output

On the trace review page, click “Expected Output” to fix the output to the expected output and store it in a dataset. This dataset can then be used as a regression test, to evaluate your pipeline as you add new features. The dataset selector is on the bottom left of the expected output modal. It defaults to the dataset you selected when you created the review queue.


Expected Output


Note: You will be able to edit the expected output further on the dataset page.



Fix hallucinations

RagMetrics can help you discover and fix hallucinations in your LLM application. We define hallucinations as inconsistencies between the input and output. This feature works best for data extraction and parsing use cases, where the output is generated by parsing the input. To use this feature:

  1. Create a review queue

  2. In the criteria section, expand “Generation Criteria”

  3. Select “Hallucination Detector”

  4. For the Judge Model, we recommend a large reasoning model, such as OpenAI’s o3-mini or DeepSeek’s deepseek-reasoner (make sure you have configured your OpenAI or DeepSeek key in the “Keys” section)

  5. Save the review queue

  6. Log a trace to the queue


The hallucination detector shows a list of possible hallucinations. Click on a hallucinations to relevant highlights from the input and output. Here’s an example:


Hallucination Detector


In this case, we can click “Expected Output” and fix the result from 7,056,652 to 7,006,652. We can then store this in the “multiplication” dataset which can be used to test whether our LLM app is good at multiplication.