Concepts

  • Experiments: Experiments in RagMetrics are A/B tests, there to help you make a data-driven decision. We support three types of experiments

    • Compare Models: Pick which model will give you the best results.

    • Compare Prompts: Pick which prompt will give you the best results.

    • Advanced: Pick a combination of models, prompts and other A/B test parameters that will give you the best results. This option is most effective when testing your own LLM pipeline. It allows you to specify cohorts (competitor groups) declaratively, using JSON. Under the hood, all experiments use this format.

  • Labeled Data: The foundation for evaluating your LLM applications.

    • Structure: Includes user questions, correct answers, and ground truth contexts.

    • Creation: Can be uploaded via spreadsheet, generated from documents, or from web pages.

    • Purpose: Used as the baseline for evaluating generated answers.

  • Evaluation Criteria:

    • RagMetrics judges performance using one or more criteria you select.

    • Pre-built library: RagMetrics offers 209 pre-built criteria for common LLM tasks. These include accuracy, succinctness and context relevance.

    • Custom criteria: Every LLM application is different. You can create your own criteria to measure performance for your LLM application.

    • Phase: Criteria can be applied during generation or retrieval. To use retrieval criteria:

      • Your labeled dataset must have ground truth contexts.

      • You must connect RagMetrics to your own LLM pipeline.

      • That pipeline must provide retrieved contexts.

    • Types: Criteria can be based on an LLM judge or deterministic functions, such as regex match, JSON difference etc.

    • Score: Criteria return one of two types of scores:

      • Likert scale (1-5, 5 is best)

      • Boolean (True or False).

    • Analytics

      • View Score Breakdowns: See how each question, prompt, or model performed based on the selected criteria.

      • Identify Changes: See which specific questions or prompts improved or degraded between variations.

      • Download Details: Download detailed results for further analysis.

      • Compare: Use the results to make informed choices about your prompts, models, and other elements of your system.