Using the GUI
Quickstart: Run an experiment
This tutorial will guide you through setting up your first experiment on the RagMetrics platform using the web interface, with no code. You’ll learn how to connect your LLM service, create a labeled dataset, and evaluate your model, all in a few simple steps.
Step 1: Connect Your LLM Service
Begin by linking RagMetrics to your LLM provider. For this tutorial, we’ll use OpenAI.
Click on Keys, then add a key.
Click on OpenAI.
Provide your OpenAI key. Here’s how to get one.
Step 2: Define Your Task
Next, specify the task you’d like your LLM to perform.
Use pre-configured integrations for GPT-4, Hugging Face, or other LLM providers.
Input your API key or custom model endpoint.
In this example, we will create a New York City Tour Guide.
This means your LLM will answer questions about New York City.
Step 3: Create a Labeled Dataset
A labeled dataset is essential for evaluating your LLM’s performance.
You can create a dataset by uploading a spreadsheet, but for this tutorial, we will generate one from reference documents.
We’ll load reference documents from New York City’s Wikipedia page.
RagMetrics will automatically:
Read the documents.
Chunk the documents.
Parse the documents.
Embed the documents.
Generate 10 questions based on the content.
Generate 10 corresponding answers.
Identify chunks of context within the documents that back up each answer.
Here’s an example:
Question: “How has the location of New York City at the mouth of the Hudson River contributed to its growth as a trading post?”
Ground Truth Answer: “The location of New York City at the mouth of the Hudson River has contributed to its growth as a trading port because it provides a naturally sheltered harbor and easy access to the Atlantic Ocean”.
Ground Truth Context: A specific excerpt from the Wikipedia page that contains the ground truth answer.
Step 4: Evaluate Your Model
Now it’s time to evaluate how well your model performs on the created dataset.
Let’s name the evaluation “Accuracy”.
Select the evaluation criteria:
Context Precision: Measures how often the retrieval model provides the correct context.
Gold Rank: Measures how high the correct context is ranked in the retrieved results.
Answer Quality: A score of 1 to 3, judged by GPT-4.
Run the evaluation. RagMetrics will:
Generate answers to each of the 10 questions in your dataset.
Compare the generated answers to the gold answers.
Assess the context retrieval to see if the correct contexts were injected into the prompt.
Step 5: Analyze the Results
Once the evaluation is complete, you’ll see an overview of the results.
The results will show:
Context precision and gold rank scores, indicating how well the system retrieved the correct context.
Answer quality score, indicating how well the model answered the questions.
The cost of the evaluation.
Downloadable details for a question-by-question breakdown.
In this example, the answer quality score is 2.4 out of 3.
Step 6: Improve Your Model (Optional)
If you’re not satisfied with the results, you can make changes and re-evaluate.
For example, you could switch to a different model. Let’s switch from GPT 3.5 Turbo to GPT-4.
You can then re-run the evaluation and see if there is any improvement.
LLM configurations
After the new evaluation, the context precision and gold rank remain the same, but the answer quality increases from 2.4 to 2.7.
Key Takeaways
RagMetrics provides an easy way to test your LLM applications using labeled data.
The platform automates the evaluation process and provides detailed insights that help you to make data driven decisions to improve your application.
You can test different models and prompts.
You can evaluate the quality of context retrieval.
Next Steps
Try other evaluation criteria.
Try testing your own data and prompts.
Book a demo with us to learn more.
This tutorial should provide you with a solid starting point for using RagMetrics to improve the quality of your LLM applications.
Monitor your LLM application
After you deploy your LLM application to production, use RagMetrics to monitor how your LLM pipeline is responding to your users:
Step 1: Log traces
First, add two lines to your Python code to log traces to RagMetrics: Getting Started: Log a trace (see API documentation)
Step 2: Create a review queue
In RagMetrics, a review queue is used for evaluating incoming traces both manually (human review) and automatically by LLM judges (online evaluation).
Navigate to “Reviews”
Click “Create Queue”
Fill out the form. Here are the key fields:
Condition: A search string. Incoming traces that include this string in either the
input
,output
ormetadata
will be included in this review queue.Criteria: Traces included in this queue will be evaluated according to the criteria you select here.
Judge Model: The LLM that will judge the traces.
Dataset: For any trace in the queue, you can fix the actual output to the expected output and store it in a dataset. This dataset can then be used as a regression test, to evaluate your pipeline as you add new features.
Click “Save”
Step 3: Review a trace
Now that you have created a review queue, all traces that match the condition will be reviewed automatically, according to the criteria you selected. Let’s watch the queue in action:
Log a new trace that matches the search condition.
Click on the queue. You should now see the trace there.
Click on the trace
This is the trace review panel. Human and automatic reviews are shown in the “Scores” panel. The trace is shown in the “Input” and “Output” panels below.
As soon as the trace is logged, it will be automatically evaluated according to the criteria selected in the review queue. If the trace matches more than one queue, all criteria from all queues will be applied. Automated reviews can take a few minutes to complete. As soon as they are complete, the scores will be shown in the “Scores” panel.
Click “Review” to add a (human) review to the trace.
You can review the trace overall (pass/fail) or give feedback on one of automated scores. This feedback to the LLM judge can improve automated judgement over time.
Step 3: Set Expected Output
On the trace review page, click “Expected Output” to fix the output to the expected output and store it in a dataset. This dataset can then be used as a regression test, to evaluate your pipeline as you add new features. The dataset selector is on the bottom left of the expected output modal. It defaults to the dataset you selected when you created the review queue.
Note: You will be able to edit the expected output further on the dataset page.
Fix hallucinations
RagMetrics can help you discover and fix hallucinations in your LLM application. We define hallucinations as inconsistencies between the input and output. This feature works best for data extraction and parsing use cases, where the output is generated by parsing the input. To use this feature:
Create a review queue
In the criteria section, expand “Generation Criteria”
Select “Hallucination Detector”
For the Judge Model, we recommend a large reasoning model, such as OpenAI’s
o3-mini
or DeepSeek’sdeepseek-reasoner
(make sure you have configured your OpenAI or DeepSeek key in the “Keys” section)Save the review queue
Log a trace to the queue
The hallucination detector shows a list of possible hallucinations. Click on a hallucinations to relevant highlights from the input and output. Here’s an example:
In this case, we can click “Expected Output” and fix the result from 7,056,652 to 7,006,652. We can then store this in the “multiplication” dataset which can be used to test whether our LLM app is good at multiplication.