Skip to main content
To evaluate your model, navigate to the ‘Evaluations’ tab and click ‘Create Evaluation’. Screen Shot2025 10 31at1 34 59PM Pn Select your evaluation method from the available options:
  • RAGAS Benchmark (Currently Available): A standardized test that measures model performance on specific tasks using predefined metrics and datasets.
  • Human Evaluation (Coming Soon) : Manual assessment by human reviewers to evaluate response quality, relevance, and appropriateness based on subjective criteria.
  • Model as Judge (Currently Available): Automatically compares your customized model against a base foundation model, with ChatGPT acting as an AI judge to evaluate responses side-by-side and determine which performs better.
Click ‘Next’ to continue.

RAGAS Benchmark evaluation

Step 1: Choose ‘RAGAS Benchmark’ evaluation method Screen Shot2025 11 25at2 28 12PM Pn
  1. Click “Create New Evaluation”
  2. Select RAGAS Benchmark
  3. Click Next
Step 2: Select your base model Screen Shot2025 11 25at2 37 42PM Pn
  1. Choose the base model you want to evaluate
  2. Click Next
Review and confirm the details, then click ‘Start Evaluation’ to begin. Screen Shot2025 11 25at2 30 03PM Pn Monitor the evaluation status in the Evaluation Dashboard. Once complete, the status will update to “Completed” and the evaluation report will open in a separate window. Screen Shot2025 10 31at1 47 09PM Pn Screen Shot2025 10 31at1 51 40PM Pn

Model as Judge evaluation

Model as Judge automatically compares your customized model against a base foundation model. ChatGPT evaluates responses side-by-side to determine which performs better.

Setting up a Model as Judge evaluation

Step 1: Choose ‘Model as Judge’ evaluation method Screen Shot2025 11 25at2 26 40PM Pn
  1. Click “Create New Evaluation”
  2. Select Model as Judge
  3. Click Next
Step 2: Select your trained model Screen Shot2025 11 24at3 20 17PM Pn
  1. Choose the customized model you want to evaluate
  2. Review the model description
  3. Click Next
Step 3: Review and confirm Screen Shot2025 11 24at3 20 49PM Pn Verify your settings:
  • Evaluation name: Auto-generated name
  • Your trained model: Your customized model (with RAG)
  • Base model: Foundation model for comparison (without RAG)
  • Evaluation type: Model as Judge
Click Start Comparison to begin.

Using the evaluation interface

Screen Shot2025 11 24at3 22 00PM Pn The interface displays a side-by-side chat comparison:
  • Left panel: Your trained model (with your data)
  • Right panel: Base model (without your data)
To test your models:
  1. Type your question in the input box
  2. Send to both models simultaneously
  3. Review responses in real-time
  4. Scroll down to view automated evaluation results

Understanding evaluation results

Each evaluation summary includes: Winner declaration
Shows which model provided the better response
Factual grounding analysis
  • Response A (RAG): How well your model uses training data
  • Response B (Base): Evaluation of the unenhanced model
Key differences
Highlights why one response outperformed the other
Winner rationale
Detailed explanation of the judge’s decision

Evaluation criteria

The AI judge evaluates responses based on:
  • Factual accuracy from source material
  • Proper use of grounding and training data
  • Relevance to the question
  • Completeness and clarity
Grounded responses using your training data consistently outperform speculative answers.

Interpreting your results

Your model wins
Your customization is working effectively. Training data is being used properly, and the model is ready for this use case.
Base model wins
A knowledge gap has been identified. Add more training data on this topic and continue refinement.
Mixed results
Partial success indicates you should add data for questions where your model underperformed and continue testing.

Best practices

Testing strategy:
  • Ask 10-15 diverse questions minimum
  • Test scenarios where your data should provide an advantage
  • Include difficult and edge cases
  • Review “Past Results” to track improvement over time
After evaluation:
  1. Identify patterns in wins and losses
  2. Add training data to address knowledge gaps
  3. Re-test to verify improvements
  4. Iterate continuously

Tips for success

  • Grounded responses always outperform speculation
  • Losses reveal where to add more training data
  • Test regularly as you add new content
  • Use realistic queries your actual users would ask