Automated Testing
The test_bco_rag.py
script contains a suite of tests designed to evaluate the functionality of the BcoRag tool using the pytest
framework and the open source LLM evaluation framework DeepEval.
Test Cases
There is one test case for each domain:
test_usability
test_io
test_description
test_execution
test_parametric
test_error
Test Metrics
The test suite evaluates two different metrics:
Answer Relevancy:
The answer relevancy metric is used to evaluate how relevant the finalized generated output (in our case, the generated domain) is to the original input prompt. It attempts to evaluate relevancy (does the generated content directly relate to the question at hand), appropriateness (is the content appropriate given the context of the input) and focus (does the content stay on topic).
The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the
actual_output
of your LLM application is compared to the provided input.
- Source: Answer Relevancy
Faithfulness:
The faithfulness metric assesses how accurate and truthful the finalized generated output (in our case, the generated domain) is concerning the source material (the retrieved content). It attempts to ensure that the content is relevant, factual, and does not contradict the information gathered from the retrieval step.
The faithfulness metric measures the quality of your RAG pipeline's generator by evaluating whether the
actual_output
factually aligns with the contents of yourretrieval_context
.
- Source: Faithfulness
Running The Tests
It is not recommended to run all the tests at once. The test suite uses gpt-4o
in the backend to evaluate the above metrics.
To run one test at a time:
deepeval test run test_bco_rag.py::test_{domain}
To run all the tests at once:
deepeval test run test_bco_rag.py