Building trust through more reliable AI

RagMetrics is an AI evaluation platform designed for Retrieval-Augmented Generation (RAG) systems. As enterprises adopt large language models (LLMs) for AI assistants and semantic search tools, ensuring reliable outputs is crucial. Our platform assesses retrieval relevance and generation accuracy, helping teams identify inaccuracies, conduct A/B tests, and automate quality assurance. RagMetrics is the first end-to-end observability solution for text-based generative AI, making it inherently trustworthy.

AI evaluation platform for RAG systems
Evaluate AI with Ragmetrics
Build by a team of technology experts
Building trust in AI products

Leadership

RagMetrics was founded by seasoned technologists from Google, Microsoft, Chyron, and Cloudflare, leaders with decades of hands-on experience in machine learning and responsible AI. Together, they’re building the foundation for trustworthy, enterprise-grade generative AI.

The RagMetrics team brings deep expertise in evaluating LLMs and solving real-world challenges like hallucinations, retrieval drift, and prompt fragility. Their expertise drives the RagMetrics’ mission: to help organizations deploy production-ready GenAI solutions.

Olivier Cohen
CEO
linkedin
Hernan Lardiez
COO
linkedin
Mike Moreno
CMO
linkedin
Investor Image

Investor Opportunities

RagMetrics is currently self-funded and actively seeking early-stage investment to scale product development and accelerate enterprise traction.

Stage & Funding

Pre-seed / Seed, with founders having invested initial resources to build a robust MVP and validate core use cases.

Clear Market Signal

We address a fast-growing need for AI evaluation and QA tools, especially for RAG-based LLM systems.

Founding Team

Deep domain expertise from Microsoft, Google, and Chyron, highlighting strong founder-market fit and technical execution capability.

Interested investors are invited to connect for more details on roadmap, traction, and projected milestones.

Contact Now

Frequently Asked Questions

Have another question? Please contact our team!

Yes. RagMetrics was built for benchmarking large language models. You can run identical tasks across multiple LLMs, compare their outputs side by side, and score them for reasoning quality, hallucination risk, citation reliability, and output robustness.

Yes. RagMetrics provides a powerful API for programmatically scoring and comparing LLM outputs. Use it to integrate hallucination detection, prompt testing, and model benchmarking directly into your GenAI pipeline.

RagMetrics can be deployed in multiple ways, including as a fully managed SaaS solution, inside your private cloud environment (like AWS, Azure, or GCP), or on-premises for organizations that require maximum control and compliance.

Running an experiment is simple. You connect your LLM or retrieval-augmented generation (RAG) pipeline—such as Claude, GPT-4, Gemini, or your own model—define the task you're solving, upload a labeled dataset or test prompts, select your scoring criteria like hallucination rate or retrieval accuracy, and then run the experiment through the dashboard or API.

To run an evaluation, you’ll need access to your LLM’s API key, the endpoint URL or model pipeline, a dataset or labeled test inputs, a clear task description, and a definition of success for that task. You can also include your own scoring criteria or subject matter expertise.

RagMetrics is model-agnostic and supports any public, private, or open-source LLM. You can paste your custom endpoint, evaluate outputs from models like Mistral, Llama 3, or DeepSeek, and compare results to popular models like GPT-4, Claude, and Gemini using the same scoring framework.

Validate LLM Responses and Accelerate Deployment

RagMetrics enables GenAI teams to validate agent responses, detect hallucinations, and speed up deployment through AI-powered QA and human-in-the-loop review.

Get Started