
- RAGAS Benchmark (Currently Available): A standardized test that measures model performance on specific tasks using predefined metrics and datasets.
- Human Evaluation (Coming Soon) : Manual assessment by human reviewers to evaluate response quality, relevance, and appropriateness based on subjective criteria.
- Model as Judge (Currently Available): Automatically compares your customized model against a base foundation model, with ChatGPT acting as an AI judge to evaluate responses side-by-side and determine which performs better.
RAGAS Benchmark evaluation
Step 1: Choose ‘RAGAS Benchmark’ evaluation method
- Click “Create New Evaluation”
- Select RAGAS Benchmark
- Click Next

- Choose the base model you want to evaluate
- Click Next



Model as Judge evaluation
Model as Judge automatically compares your customized model against a base foundation model. ChatGPT evaluates responses side-by-side to determine which performs better.Setting up a Model as Judge evaluation
Step 1: Choose ‘Model as Judge’ evaluation method
- Click “Create New Evaluation”
- Select Model as Judge
- Click Next

- Choose the customized model you want to evaluate
- Review the model description
- Click Next

- Evaluation name: Auto-generated name
- Your trained model: Your customized model (with RAG)
- Base model: Foundation model for comparison (without RAG)
- Evaluation type: Model as Judge
Using the evaluation interface

- Left panel: Your trained model (with your data)
- Right panel: Base model (without your data)
- Type your question in the input box
- Send to both models simultaneously
- Review responses in real-time
- Scroll down to view automated evaluation results
Understanding evaluation results
Each evaluation summary includes: Winner declarationShows which model provided the better response Factual grounding analysis
- Response A (RAG): How well your model uses training data
- Response B (Base): Evaluation of the unenhanced model
Highlights why one response outperformed the other Winner rationale
Detailed explanation of the judge’s decision
Evaluation criteria
The AI judge evaluates responses based on:- Factual accuracy from source material
- Proper use of grounding and training data
- Relevance to the question
- Completeness and clarity
Grounded responses using your training data consistently outperform speculative answers.
Interpreting your results
Your model winsYour customization is working effectively. Training data is being used properly, and the model is ready for this use case. Base model wins
A knowledge gap has been identified. Add more training data on this topic and continue refinement. Mixed results
Partial success indicates you should add data for questions where your model underperformed and continue testing.
Best practices
Testing strategy:- Ask 10-15 diverse questions minimum
- Test scenarios where your data should provide an advantage
- Include difficult and edge cases
- Review “Past Results” to track improvement over time
- Identify patterns in wins and losses
- Add training data to address knowledge gaps
- Re-test to verify improvements
- Iterate continuously
Tips for success
- Grounded responses always outperform speculation
- Losses reveal where to add more training data
- Test regularly as you add new content
- Use realistic queries your actual users would ask

