Evaluate your Model Performance

To evaluate your model, navigate to the ‘Evaluations’ tab and click ‘Create Evaluation’.

Select your evaluation method from the available options:

RAGAS Benchmark (Currently Available): A standardized test that measures model performance on specific tasks using predefined metrics and datasets.
Human Evaluation (Coming Soon) : Manual assessment by human reviewers to evaluate response quality, relevance, and appropriateness based on subjective criteria.
Model as Judge (Currently Available): Automatically compares your customized model against a base foundation model, with ChatGPT acting as an AI judge to evaluate responses side-by-side and determine which performs better.

Click ‘Next’ to continue.

RAGAS Benchmark evaluation

Step 1: Choose ‘RAGAS Benchmark’ evaluation method

Click “Create New Evaluation”
Select RAGAS Benchmark
Click Next

Step 2: Select your base model

Choose the base model you want to evaluate
Click Next

Review and confirm the details, then click ‘Start Evaluation’ to begin.

Monitor the evaluation status in the Evaluation Dashboard. Once complete, the status will update to “Completed” and the evaluation report will open in a separate window.

Model as Judge evaluation

Model as Judge automatically compares your customized model against a base foundation model. ChatGPT evaluates responses side-by-side to determine which performs better.

Setting up a Model as Judge evaluation

Step 1: Choose ‘Model as Judge’ evaluation method

Click “Create New Evaluation”
Select Model as Judge
Click Next

Step 2: Select your trained model

Choose the customized model you want to evaluate
Review the model description
Click Next

Step 3: Review and confirm

Verify your settings:

Evaluation name: Auto-generated name
Your trained model: Your customized model (with RAG)
Base model: Foundation model for comparison (without RAG)
Evaluation type: Model as Judge

Click Start Comparison to begin.

Using the evaluation interface

The interface displays a side-by-side chat comparison:

Left panel: Your trained model (with your data)
Right panel: Base model (without your data)

To test your models:

Type your question in the input box
Send to both models simultaneously
Review responses in real-time
Scroll down to view automated evaluation results

Understanding evaluation results

Each evaluation summary includes: Winner declaration
Shows which model provided the better response Factual grounding analysis

Response A (RAG): How well your model uses training data
Response B (Base): Evaluation of the unenhanced model

Key differences
Highlights why one response outperformed the other Winner rationale
Detailed explanation of the judge’s decision

Evaluation criteria

The AI judge evaluates responses based on:

Factual accuracy from source material
Proper use of grounding and training data
Relevance to the question
Completeness and clarity

Grounded responses using your training data consistently outperform speculative answers.

Interpreting your results

Your model wins
Your customization is working effectively. Training data is being used properly, and the model is ready for this use case. Base model wins
A knowledge gap has been identified. Add more training data on this topic and continue refinement. Mixed results
Partial success indicates you should add data for questions where your model underperformed and continue testing.

Best practices

Testing strategy:

Ask 10-15 diverse questions minimum
Test scenarios where your data should provide an advantage
Include difficult and edge cases
Review “Past Results” to track improvement over time

After evaluation:

Identify patterns in wins and losses
Add training data to address knowledge gaps
Re-test to verify improvements
Iterate continuously

Tips for success

Grounded responses always outperform speculation
Losses reveal where to add more training data
Test regularly as you add new content
Use realistic queries your actual users would ask

Intro

Getting Started with 4Minds

4Minds Features

What's New in 4Minds

Key Terms and Concepts in 4Minds

Developer Tools

Evaluate your Model Performance

RAGAS Benchmark evaluation