- Harsh Vajpayee and Lolika Padmanabhan
- October 23, 2024
Harnessing the Power of LLM Evaluation with RAGAS: A Comprehensive Guide
Embracing Synthetic Data: The Benefits
Large Language Models (LLMs) like GPT-4, Llama-2, Llama-3, Mistral, and Phi have changed how we handle, understand and process language. However, evaluating these models is crucial to make sure they work well and are trustworthy in different applications. The RAGAS framework is a new and emerging method for evaluating these models. This blog explores how RAGAS works and how it can be used to assess LLMs.
RAGAS, or RAG Assessment, is a framework designed to evaluate Retrieval Augmented Generation (RAG) pipelines and an application that uses Large Language Models (LLMs) enhanced by external data to improve context and relevance. While there are various tools available to build these RAG pipelines, assessing their performance can be challenging. RAGAS addresses this challenge by providing a suite of tools based on the latest research, enabling users to evaluate and quantify the effectiveness of LLM-generated text within their pipelines.
RAGAS not only helps in assessing the quality of the output but can also be integrated into Continuous Integration and Continuous Deployment (CI/CD) workflows, allowing for continuous monitoring and performance checks. This integration ensures that the RAG pipeline maintains its quality and effectiveness over time, providing valuable insights and helping to optimize the system’s performance.
RAGAS Evaluation Metrics
1. RAGAS Evaluation Metrics
- The query or prompt provided to the chatbot.
- Serves as the initial input for the system.
2. Context
- Relevant text passages extracted from the knowledge base.
- Provide information to support the answer generation process.
- Crucial for understanding the chatbot’s ability to retrieve relevant information.
3. Answer
- The generated response from the chatbot.
- Reflects the system’s ability to process information from the contexts and questions.
4. Ground Truth Answer
- The correct or expected response to the question.
- Serves as the benchmark for evaluating the generated answer’s accuracy.
Components of RAGAS metrics
1. Faithfulness:
This metric evaluates the factual consistency of a generated answer against a given context. It is calculated by comparing the claims made in the answer to those inferable from the context. The score is scaled between 0 and 1, with higher values indicating greater faithfulness.
Calculation
A generated answer is considered faithful if all its claims can be supported by the provided context. To compute the faithfulness score:
- Identify claims: Extract a set of claims from the generated answer.
- Cross-check: Verify each claim against the context to determine if it can be inferred.
- Calculate score: Use the following formula:
Faithfulness score = Number of supported claims / Total number of claims
Example [1],2
Question: Where and when was C.V. Raman born?
Context: C.V. Raman (born 7 November 1888) was an Indian physicist, known for his groundbreaking work on the scattering of light and for the discovery of the Raman effect.
High faithfulness answer: Raman was born in India on 7th November 1888.
- Both claims (born in India and born on 7th November 1888) can be inferred from the context.
- Faithfulness score = 2/2 = 1.0
Low faithfulness answer: Raman was born in India on 10th November 1888.
- Only the claim “born in India” is supported by the context.
- Faithfulness score = 1/2 = 0.5
2. Answer Relevance:
The Answer Relevance metric assesses how closely a generated answer aligns with the original prompt. Higher scores indicate better relevance, while lower scores suggest incompleteness or redundancy.
Calculation
To compute Answer Relevance, the following steps are involved:
- Generate Artificial Questions: Create multiple questions based on the generated answer. The default number is 3.
- Embed Questions: Convert the original question and the generated questions into numerical representations (embeddings).
- Calculate Cosine Similarity: Determine the cosine similarity between the embedding of the original question and each generated question embedding.
- Compute Mean Cosine Similarity: Calculate the average cosine similarity of all generated questions to the original question. This value is the Answer Relevance score.
Formula:
answer_relevance = (1/N) * Σ(cos (Egi, Eo))
Where:
- N is the number of generated questions (default: 3)
- Egi is the embedding of the i-th generated question
- Eo is the embedding of the original question
Note: While the cosine similarity ranges from -1 to 1, in practice, the Answer Relevance score typically falls between 0 and 1.
Interpretation
- Higher scores indicate a strong alignment between the generated answer and the original question.
- Lower scores suggest the answer might be incomplete, irrelevant, or redundant.
By evaluating Answer Relevance, we can assess the quality and appropriateness of the generated response.
3. Context Precision
Context Precision measures the accuracy of ranking relevant information within a set of contexts. It evaluates how effectively the most pertinent information is placed at the top of the ranked list.
Calculation
The Context Precision metric is calculated as follows:
Context Precision@K = Σ(Precision@k * vk) / Total number of relevant items in the top K results
Where:
- K is the total number of chunks in the contexts.
- Precision@k is the proportion of relevant items among the top k ranked items.
- vk is a binary indicator (0 or 1) denoting whether the item at rank k is relevant.
Precision@k is calculated as:
Precision@k = true positives@k / (true positives@k + false positives@k)
Interpretation
- Higher scores indicate better context precision, meaning relevant information is consistently ranked higher.
- Lower scores suggest that relevant information is scattered or ranked lower than less relevant content.
Example [1],2
Question: Where is India and what is its capital?
Ground Truth: India is in South Asia and its capital is New Delhi.
High Context Precision:
- Both relevant pieces of information (“India is in South Asia” and “its capital is Paris”) are present in the first context.
Low Context Precision:
- One relevant piece of information (“India is in South Asia”) is in the second context, while the other (“its capital is New Delhi”) is in the first.
By prioritizing relevant information at the top of the ranked contexts, Context Precision helps improve the overall performance of systems that rely on context for information retrieval or generation.
4. Context Recall
Context Recall measures the extent to which information from the retrieved context aligns with the ground truth answer. A higher score indicates better alignment.
Calculation
To calculate context recall:
- Identify claims: Extract claims from the ground truth answer.
- Check attribution: Determine if each claim can be supported by the retrieved context.
- Calculate recall: Use the following formula:
context recall = |GT claims attributed to context| / |Total number of claims in GT|
Interpretation
- A score closer to 1 indicates strong alignment between the retrieved context and the ground truth answer.
- A lower score suggests missing information in the retrieved context.
Example [1,2]
Question: Where is India and what is its capital?
Ground Truth: India is in South Asia and its capital is New Delhi.
- High context recall: The retrieved context mentions both “India is in South Asia” and “New Delhi, its capital”.
- Low context recall: The retrieved context only mentions “India is in South Asia” but not its capital.
By evaluating context recall, we can assess the quality and comprehensiveness of the retrieved context in relation to the desired information.
5. Context Entities Recall
Context Entities Recall measures how effectively a retrieved context covers the entities present in the ground truth. It’s particularly useful in fact-based domains like tourism or historical questions answering where entities are crucial.
Calculation
The formula for context entities recall is:
context entity recall = |CE ∩ GE| / |GE|
Where:
- CE is the set of entities present in the retrieved context.
- GE is the set of entities present in the ground truth.
Essentially, it calculates the ratio of entities found in both the context and the ground truth to the total number of entities in the ground truth.
Interpretation
- A higher score indicates better entity coverage in the retrieved context.
- A lower score suggests that the context is missing important entities from the ground truth.
Example [1,2]
Ground Truth: The Qutub Minar is a red sandstone and marble minaret located in Delhi, India. It was commissioned by Qutb-ud-din Aibak in 1192 to celebrate his victory over the last Hindu kingdom of Delhi.
Entities in Ground Truth (GE): Qutub Minar, red sandstone, marble, Delhi, Qutb-ud-din Aibak
High Entity Recall Context: The Qutub Minar is a remarkable architectural structure in Delhi, India, built by Qutb-ud-din Aibak in 1192. This towering minaret is made of red sandstone and marble, showcasing intricate carvings and inscriptions from the Islamic era.
Low Entity Recall Context: The Qutub Minar is a famous historical monument in India. It stands tall as a UNESCO World Heritage Site and is visited by numerous tourists each year for its stunning architecture and historical significance.
In the high entity recall context, most entities from the ground truth are included, while in the low entity recall context, key entities like red sandstone, marble, and Qutb-ud-din Aibak are missing.
By evaluating context entities recall, you can assess how well the retrieved context aligns with the entity-based information in the ground truth.
6. Answer Semantic Similarity
Answer Semantic Similarity evaluates how closely a generated answer matches the ground truth in terms of meaning. This metric goes beyond simple lexical overlap to capture the underlying semantic relationship between the two texts.
Calculation
Typically, a cross-encoder model is used to compute the semantic similarity score. This model takes the ground truth and the generated answer as input and produces a similarity score between 0 and 1. A higher score indicates greater semantic similarity.
Interpretation
- Higher scores indicate that the generated answer conveys a similar meaning to the ground truth.
- Lower scores suggest that the generated answer is semantically different from the ground truth.
Example [1,2]
Ground Truth: C.V. Raman’s discovery of the Raman effect revolutionized our understanding of light scattering.
- High similarity answer: Raman’s groundbreaking discovery of the Raman effect transformed our comprehension of light scattering
- Low similarity answer: Jagadish Chandra Bose’s work on radio waves significantly advanced early wireless communication research.
By measuring answer semantic similarity, we can assess the quality of generated answers in a more nuanced way than traditional metrics like exact match or F1-score.
7. Answer Correctness
Answer Correctness is a metric that assesses the accuracy of a generated answer compared to the ground truth. It combines both semantic and factual similarity for a comprehensive evaluation.
Key Components
- Semantic Similarity: Measures how closely the meaning of the generated answer aligns with the ground truth.
- Factual Similarity: Evaluates the accuracy of the factual information presented in the answer.
Calculation
- Calculate Semantic Similarity: Use a cross-encoder model or other techniques to determine the semantic overlap between the answer and ground truth.
- Calculate Factual Similarity: Employ fact-checking or knowledge base comparison to assess the accuracy of the factual claims.
- Combine Scores: Assign weights to semantic and factual similarity based on the specific use case. Combine the weighted scores to obtain the final answer correctness score.
- Apply Threshold (Optional): If desired, round the final score to a binary value (0 or 1) using a predefined threshold.
Example [1,2]
Ground Truth: Ignitarium was established in 2012 in Bangalore.
- High answer correctness: In 2012, Ignitarium was established in Bangalore. (Both semantic and factual similarity are high)
- Low answer correctness: Ignitarium was established in Kochi in 2012. (Semantic similarity might be reasonable, but factual similarity is low)
Additional Considerations
- Complex Answers: For answers with multiple components, consider breaking them down into smaller units for evaluation.
- Domain-Specific Knowledge: Incorporate domain-specific knowledge bases or ontologies to enhance factual accuracy assessment.
- Human Evaluation: Human judgment can be valuable for complex or ambiguous cases.
By combining semantic and factual similarity, answer correctness provides a more robust evaluation of generated text compared to traditional metrics like exact match or F1-score.
8. Aspect Critique
Aspect Critique is a binary classification method used to evaluate text submissions based on predefined aspects or user-defined criteria.
Key Features
- Predefined Aspects: Offers a set of standard aspects like harmfulness, correctness, coherence, conciseness, etc.
- Custom Aspects: Allows users to define specific criteria tailored to their needs.
- Binary Output: Provides a simple yes/no answer for each aspect.
- Strictness Parameter: Controls the sensitivity of the evaluation.
How it Works
- Define Aspects: Select from predefined aspects or create custom ones.
- Evaluate Submission: Analyze the text against each defined aspect.
- Assign Binary Score: Determine if the submission aligns with the aspect (1) or not (0).
Example
Predefined Aspects: harmfulness, correctness
Submission: “The sky is blue, and the grass is green.”
- Harmfulness: 0 (not harmful)
- Correctness: 1 (correct)
Strictness Parameter
The strictness parameter influences the model’s decision-making. A higher strictness value makes the model more conservative in assigning positive scores.
Applications
Aspect Critique can be used for:
- Content moderation: Identifying harmful or inappropriate content.
- Quality assurance: Evaluating the correctness and coherence of text.
- Text summarization: Assessing the conciseness of summaries.
- Custom evaluation: Creating specific criteria for different use cases.
By providing a clear and binary evaluation, Aspect Critique offers a straightforward way to assess text submissions based on predefined or custom criteria.
Conclusion
In conclusion, the RAGAS framework provides a valuable tool for evaluating Retrieval Augmented Generation (RAG) pipelines. Unlike traditional metrics that focus primarily on quantitative aspects like n-gram overlap and prediction accuracy, RAGAS offers a more comprehensive and nuanced assessment. It addresses various dimensions of performance, which may include relevance, accuracy, and other qualitative aspects, depending on the specific implementation. This broader approach helps capture the full range of a model’s capabilities and ensures that it not only generates correct information but also delivers it in a coherent, contextually appropriate, and user-friendly manner. By using RAGAS, developers and researchers can gain deeper insights into their models’ strengths and weaknesses, leading to better optimization and more effective real-world applications.