Manage your Datasets

Supported File Types

The 4MINDS platform accepts the following file formats for dataset uploads:

Text files (.txt) - Plain text documents
Markdown files (.md) - Formatted text documents with markup
CSV files (.csv) - Comma-separated value spreadsheets
JSON files (.json) - Structured data in JSON format
Parquet files (.parquet) - Columnar storage format (not supported for Hugging Face imports)
PDF files (.pdf) - Portable document format files
Word documents (.docx) - Microsoft Word documents
Excel spreadsheets (.xlsx) - Microsoft Excel workbooks
JPEG images (.jpg, .jpeg) - Compressed image files
PNG images (.png) - Portable network graphics
GIF images (.gif) - Graphics interchange format
BMP images (.bmp) - Bitmap image files
TIFF images (.tiff) - Tagged image file format
ZIP archives - Compressed folders containing multiple files

Automatic OCR Processing

The 4MINDS platform automatically extracts text from images and scanned documents using built-in Optical Character Recognition (OCR). This feature works seamlessly across all base models, no configuration required. When OCR is used:

PDF files with scanned or non-selectable text
Image files (JPG, PNG, TIFF, BMP, GIF) containing text
Documents with embedded images

How it works: When you upload files, our Reflex Router™ automatically detects content that requires OCR processing and extracts the text. The extracted content is then made available for model training and inference, just like any other text data. Key benefits:

Works with any base model you select for inline tuning
No manual configuration needed
Seamlessly integrated into the data processing pipeline

OCR accuracy depends on image quality and resolution. For best results, use clear, high-resolution scans.

Upload Size Limit

You can upload up to 100 MB of data at a time. This applies to single files, multiple files, or integration datasets. A progress bar will display the total upload size.
To upload more data, simply reopen the dataset and upload the next 100 MB batch.
There is no limit on the overall dataset size, only on each individual upload batch.

Adding Data to Existing Datasets

As your business evolves, your model’s knowledge needs to evolve with it. Adding new data to existing datasets keeps your AI current and effective without starting from scratch.

Why Continuous Data Updates Matter

Maintain accuracy - Product features change, policies update, and new edge cases emerge. Without fresh data, your model provides outdated information that frustrates users and erodes trust. Capture new patterns - Each customer interaction reveals new ways people describe problems, ask questions, or use your product. Adding these examples helps your model understand diverse communication styles. Improve coverage - Initial training datasets rarely cover every scenario. As you discover gaps in your model’s knowledge, you can fill them by adding targeted data. Adapt to business changes - New products, services, pricing models, or support processes require corresponding updates to your training data.

How to Add New Data

From the Datasets tab

Navigate to the Datasets tab
Locate the dataset you want to update
Click on the dataset to open it
Click Upload Additional Data
Choose your data source: Upload Files, Integrations, or URL.
Upload your new data based on the selected source.

The platform automatically processes and integrates the new data into your knowledge graph

From the Model tab

Navigate to Models tab
Select the model you want to update with additional data.
Click the three-dot menu (⋮) in the Actions column for the model you want to update with additional data.
Click the Add Training Data button in the shortcuts section
Click Upload Additional Files
Choose your data source: Upload Files, Integrations, or URL.
Upload your new data based on the selected source.
The data is processed and added to your model’s knowledge graph automatically

From the Control Center

Open your model in the Control Center
Navigate to Training Data tab in the shortcuts section
Click **Add Data to **Current Dataset
Click Upload Additional Files
Choose your data source: Upload Files, Integrations, or URL.
Upload your new data based on the selected source.
The data is processed and added to your model’s knowledge graph automatically

New data is automatically integrated into your existing knowledge graph. Nodes and edges update to reflect the new information without disrupting existing knowledge structures.

Best practices for Data Updates

Add incrementally - Rather than waiting to upload large batches, add new data regularly as it becomes available. This keeps your model current and makes it easier to track what information was added when. Document your updates - Keep notes on what data you added and why. This helps you understand model behavior changes and plan future updates. Test after updates - Use the Inference Model feature to verify that new data is being used correctly and hasn’t introduced conflicts with existing knowledge. Mix data types - When adding new information, include multiple formats when possible. For example, if you’re adding a new product feature, include documentation (PDF), example support tickets (CSV), and screenshots (images). Retrain when needed - After significant data additions, retrain your model to fully integrate the new knowledge. Minor updates may not require retraining, but substantial changes benefit from it.

Building Comprehensive Datasets

Training an effective AI model requires more than uploading a single file type. Just as you wouldn’t hire a customer support agent and only give them a product manual, your model needs diverse perspectives and contexts to develop true understanding.

Example: Training on Customer Support Excellence

Let’s say you want your model to handle customer support inquiries effectively. Here’s how to structure a robust, multimodal dataset using 4MINDS’ supported formats: Visual understanding (images & screenshots) Upload visual content showing real customer interactions:

Product interfaces - Screenshots of your software, dashboard views, error messages, feature locations
Troubleshooting visuals - Common configuration issues, installation steps, system architecture diagrams
Documentation - Annotated screenshots showing workflows, setup guides, integration diagrams
Error states - What customers see when things go wrong, loading states, failure modes
Customer-submitted images - Photos of hardware issues, setup problems, packaging damage
Competitor products - Interface comparisons, feature differences, migration guides

Conceptual knowledge (PDFs & documents) Add comprehensive written content:

Product documentation - Technical specifications, API references, user guides, release notes
Internal knowledge bases - Troubleshooting playbooks, known issues, workaround procedures
Policy documents - SLA agreements, refund policies, terms of service, data privacy guidelines
Training materials - Onboarding docs for new support agents, escalation procedures, quality standards
Industry context - Regulatory compliance guides, security best practices, industry standards
Best practices - Customer service frameworks, communication guidelines, de-escalation techniques
Competitive intelligence - How competitors solve similar problems, market positioning, feature comparisons

Structured data (CSV & spreadsheet files) Include quantitative patterns and history:

Support ticket history - Ticket IDs, timestamps, issue categories, resolution times, customer satisfaction scores
Customer data - Account types, subscription tiers, usage patterns, feature adoption rates
Product usage analytics - Most-used features, error rates, session durations, drop-off points
Response metrics - First response time, resolution time, reopened tickets, escalation rates
Customer sentiment - NPS scores, CSAT ratings, survey responses, sentiment analysis results
Seasonal patterns - Ticket volume by time/day/season, spike events, capacity planning data
Agent performance - Resolution rates, customer satisfaction per agent, specialization areas

Communication history (email & chat logs) Provide real conversation examples:

Resolved tickets - Successful interactions showing problem identification and resolution
Escalated cases - Complex issues requiring multiple touchpoints or specialist involvement
Edge cases - Unusual requests, policy exceptions, creative problem-solving examples
Tone variations - Professional responses, empathetic communications, frustrated customer de-escalation
Multi-channel interactions - Email threads, chat transcripts, phone call summaries, social media responses
Follow-ups - Post-resolution check-ins, proactive outreach, account management communications

Audio (coming soon) Add dynamic training materials:

Call recordings - Customer support calls showing tone, pacing, active listening, problem resolution
Product demos - Video walkthroughs of features, setup processes, advanced use cases
Training sessions - Internal workshops, role-playing scenarios, best practice reviews
Customer feedback sessions - User interviews, usability testing, feature request discussions

Contextual business data (mixed formats) Round out understanding with operational context:

Product roadmap - Upcoming features, deprecation schedules, beta programs
Billing systems - Invoice examples, pricing tiers, renewal processes, refund workflows
Integration documentation - Third-party connections, API partnerships, data sync processes
Company information - Team structure, hours of operation, regional support coverage, contact escalation paths
Legal & compliance - GDPR requirements, data handling procedures, audit trails, security protocols

Why this Matters

When you combine these diverse data types, your model develops:

Contextual problem-solving that understands not just what the issue is, but why it matters and how it impacts the customer’s business
Tone awareness from seeing thousands of interactions, knowing when to be technical vs empathetic, formal vs conversational
Pattern recognition identifying common issues before customers fully describe them, predicting follow-up questions
Operational intelligence understanding SLAs, escalation paths, when to involve specialists, and business constraints
Proactive guidance suggesting solutions based on similar past cases, usage patterns, and product knowledge

A model trained only on product documentation would fail when a frustrated customer describes a problem in non-technical terms, or when an edge case requires policy interpretation. But a model trained with this comprehensive, multimodal approach develops the nuanced intelligence to handle real customer interactions effectively.

Tutorial: Fine-Tune a Model with Hugging Face Datasets

This tutorial walks you through importing datasets from Hugging Face to train a custom model in 4MINDS.

What you’ll build

By the end of this tutorial, you’ll have a custom model trained on Hugging Face data that can:

Understand domain-specific terminology and concepts
Extract relevant information from your training data
Provide accurate, contextual responses to queries in your domain

Prerequisites

A 4MINDS account with access to the platform
A Hugging Face account with a generated access token
Basic understanding of the 4MINDS model creation workflow

Fine-tuning overview

Fine-tuning allows you to customize base models for your specific use case by training them on your own data. The fine-tuning feature enables you to:

Create custom models tailored to your domain (e.g., financial analysis, customer support)
Train on proprietary datasets to improve accuracy for specific tasks
Deploy models via API or test them in the interactive Playground
Monitor performance metrics including response time, token speed, and success rate

Model status types

Status	Description
Ready	Model is trained and available for use
Building Graph	Model is currently being compiled (shows percentage progress)
Training	Model is actively learning from training data
Archived	Model is stored but not actively deployed

Selecting a base model

Base Model	Parameters	Best For
Phi	14b	Lightweight tasks, faster inference
Gemma	27b	Balanced performance and capability
Nemotron	70b	Complex reasoning, highest accuracy

Training data best practices

Provide diverse examples – Include variations of similar questions to improve generalization
Maintain consistency – Use a consistent format and tone across all training samples
Include edge cases – Add examples of boundary conditions and unusual queries
Quality over quantity – 500 high-quality examples often outperform 5,000 poor ones

Step 1: Access the data upload screen

During the model creation process (Step 3 of 4), you’ll reach the Data Upload screen. Here you can choose how to provide training data to customize your model.

Under Choose Data Source, ensure the Upload New Data tab is selected
You have three options under Add Files from Sources:
- Upload Files - Local files from your computer
- Integrations - External data sources
- URL - Import from a web address
To import from Hugging Face, click the Integrations button

Step 2: Select Hugging Face integration

On the Select Integration screen, you’ll see a list of available data source integrations including Amazon S3, Azure Blob Storage, Google Cloud Storage, and others.

Scroll down and select Hugging Face from the list

If you see “Not configured” next to an integration, you may need to set up credentials first via Configure Integrations at the top of the list.

Step 3: Search for your dataset

The Import from HuggingFace screen allows you to search the Hugging Face Hub for datasets.

Enter your search query in the search bar (e.g., “finQA”)
Click Search
Browse the results using the available tabs:
- Popular Datasets – Trending datasets on Hugging Face
- My Datasets – Your personal Hugging Face datasets
- Search Results – Results matching your query

Each dataset card displays helpful information including:

Dataset name and author
Description
Download count
Size and format
Task type and modality

Click on the dataset you want to import.

Step 4: Configure dataset import settings

On the Dataset Details screen, you can configure import settings for your selected dataset. Review the dataset information:

Name and author
Description
Download statistics
Task IDs, size, and format

Configure the following options:

Configuration – Select the dataset configuration (e.g., “Default”)
Split – Choose which data split to import (e.g., “Test”, “Train”, “Validation”)

When ready, click + Add Dataset to import the files.

Step 5: Review attached files

After importing, you’ll return to the Data Upload screen. Your imported files now appear under Attached Files with details including:

File name
Source (Hugging Face icon)
File size
Row count

For example, importing a dataset might result in files like:

relevance.jsonl – 66.68 KB, 341 rows
queries.jsonl – 137.71 KB, 705 rows
corpus.jsonl – 1.44 MB, 7549 rows

Rsync settings (optional) Enable Rsync Settings to automatically sync new files from your Hugging Face sources when you log in. This keeps your training data up to date. Click Next to proceed.

Step 6: Review and launch training

On the Review & Launch screen (Step 4 of 4), verify your configuration summary:

Setting	Value
Use Case	Your selected use case
Base Model	e.g., Phi-4-14B AWQ
Data Files	Imported from Hugging Face
Rsync Configuration	HuggingFace - All folders
Persona	Your selection or default
Deployment	e.g., Cloud API

If everything looks correct, click Confirm & Train to start the training process.

Step 7: Monitor training progress

After launching, you’ll be taken to the Models dashboard in Control Center. Your new model will appear in the list with:

Status – “New” badge with “Building Graph” progress indicator
Parameters – Model size (e.g., 14b)
Base – Base model used (e.g., Phi)
Created – Timestamp

The status will update as training progresses through the pipeline. Once complete, the status will change to Ready.

Step 8: Test in the Playground

The Playground provides an interactive environment to evaluate your fine-tuned model before deployment. Accessing the Playground:

From the model dashboard, click the ⋮ menu on any model
Select Run Model
Or navigate to Control Center → Playground and select your model

Playground features:

Real-time responses – See model outputs as they generate
Conversation history – Maintain context across multiple turns
View Graph – Visualize model reasoning and token flow
Clear All Chats – Reset the conversation history
Add Model – Compare multiple models side-by-side

Example test queries for a financial analysis model:

“What is the ratio of operating income to total revenue?”
“What is the total of all lease obligations?”
“What was the percentage change in revenue from 2018 to 2019?”

Model actions

Access these options via the ⋮ menu on any model:

Action	Description
Run Model	Open the model in the Playground for testing
API Access	View API endpoints and authentication details
Edit Model	Modify model configuration and settings
Add Training Data	Upload additional training examples
Full Screen	Expand the model view
Duplicate	Create a copy of the model with its settings
Archive	Move to archived storage (can be restored)
Delete	Permanently remove the model

API integration

Deploy your fine-tuned model via API for production use. Getting API credentials:

Click ⋮ on your model
Select API Access
Copy your API endpoint and authentication token

Example request:

curl -X POST https://api.4minds.ai/v1/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "customer-faq-expert",
    "prompt": "How do I reset my password?",
    "max_tokens": 500
  }'

Performance optimization

Improving response quality:

Add more training data – Expand coverage of your use case
Refine existing data – Remove low-quality or contradictory examples
Adjust the persona – Use “Technical Expert” for specialized domains
Choose appropriate model size – Larger models (70b) handle complex reasoning better

Improving speed:

Use smaller base models – Phi (14b) offers faster inference
Optimize prompt length – Shorter prompts reduce processing time
Enable caching – Reuse responses for common queries

Troubleshooting

Issue	Solution
Model stuck on “Building Graph”	Large models may take longer; check progress percentage
Low success rate	Review training data for errors or inconsistencies
Slow response time	Consider a smaller base model or optimize prompts
Inaccurate responses	Add more diverse training examples

FAQs

Q: How long does training take? Training time depends on model size and dataset. Expect 30 minutes to several hours for large models. Q: Can I update a model after deployment? Yes, use “Add Training Data” to incrementally improve your model. Q: What’s the difference between Archive and Delete? Archived models can be restored; deleted models are permanently removed. Q: How many models can I have active? Check your plan limits in the account settings.

Supported file formats

4MINDS supports the following file formats for training data from Hugging Face: BMP, CSV, DOCX, GIF, HTML, JPEG, JPG, JSON, JSONL, MD, ODT, PARQUET, PDF, PNG, TIFF, TSV, TXT, XLSX Multiple files are supported per upload.

Tips

Choose appropriate splits – For fine-tuning, you typically want the “Train” split. Use “Test” or “Validation” for evaluation datasets.
Check dataset size – Larger datasets may take longer to import and process.
Enable Rsync – If you’re working with frequently updated datasets, enable Rsync to stay current automatically.

For best results, combine the Hugging Face dataset with your organization’s proprietary documents. This creates a model that understands both general concepts and your specific business context.

Tutorial: Fine-Tune a Model with Hugging Face Datasets via API

This tutorial shows how to fine-tune a 4MINDS model using the FinQA dataset from Hugging Face through the API. Since the API requires manual dataset uploads, you’ll download the dataset from Hugging Face and upload it to 4MINDS.

What you’ll build

A custom model trained on financial Q&A data, created entirely through API calls, ideal for automation and CI/CD pipelines.

Prerequisites

A 4MINDS account with API access
Your API key (found in Account Settings)
A Hugging Face account with a generated access token
Python 3.7+ with the requests and datasets libraries installed

Step 1: Download the FinQA dataset from Hugging Face

First, download the FinQA dataset locally using the Hugging Face datasets library:

from datasets import load_dataset
import json

# Load the FinQA dataset
dataset = load_dataset("ibm/finqa", split="train")

# Convert to JSON format for upload
data = [{"question": item["question"], "answer": item["answer"]} for item in dataset]

# Save to a local file
with open("finqa_training_data.json", "w") as f:
    json.dump(data, f, indent=2)

print(f"Saved {len(data)} training examples to finqa_training_data.json")

Step 2: Upload the dataset to 4MINDS

Use the 4MINDS API to create a dataset and upload your file:

import requests

API_KEY = "your_api_key_here"
BASE_URL = "https://api.4minds.ai/api/v1"

headers = {
    "Authorization": f"Bearer {API_KEY}",
}

# Create a new dataset with the uploaded file
with open("finqa_training_data.json", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/user/dataset",
        headers=headers,
        files={"file": ("finqa_training_data.json", f, "application/json")},
        data={"name": "FinQA Training Data"}
    )

dataset_response = response.json()
dataset_id = dataset_response["id"]
print(f"Created dataset with ID: {dataset_id}")

Step 3: Create a model with the dataset attached

Now create a new model and attach your dataset for training:

# Create a new model with the dataset
model_payload = {
    "name": "Financial QA Assistant",
    "description": "Fine-tuned on FinQA dataset for financial question answering",
    "dataset_id": dataset_id
}

response = requests.post(
    f"{BASE_URL}/user/model",
    headers={**headers, "Content-Type": "application/json"},
    json=model_payload
)

model_response = response.json()
model_id = model_response["id"]
print(f"Created model with ID: {model_id}")
print(f"Training status: {model_response['status']}")

Step 4: Monitor training progress

Poll the API to check when training completes:

import time

while True:
    response = requests.get(
        f"{BASE_URL}/user/model/{model_id}",
        headers=headers
    )
    status = response.json()["status"]
    print(f"Training status: {status}")
    
    if status == "ready":
        print("Training complete!")
        break
    elif status == "failed":
        print("Training failed. Check the dashboard for details.")
        break
    
    time.sleep(30)  # Check every 30 seconds

Step 5: Test your model via API

Once training completes, send inference requests to your fine-tuned model:

# Send a test query
inference_payload = {
    "model_id": model_id,
    "message": "What was the revenue growth percentage year-over-year?"
}

response = requests.post(
    f"{BASE_URL}/user/model/{model_id}/inference",
    headers={**headers, "Content-Type": "application/json"},
    json=inference_payload
)

print("Model response:")
print(response.json()["response"])

Complete script

Here’s the full workflow in a single script:

from datasets import load_dataset
import requests
import json
import time

# Configuration
API_KEY = "your_api_key_here"
BASE_URL = "https://api.4minds.ai/api/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}

# Step 1: Download FinQA from Hugging Face
print("Downloading FinQA dataset...")
dataset = load_dataset("ibm/finqa", split="train")
data = [{"question": item["question"], "answer": item["answer"]} for item in dataset]
with open("finqa_training_data.json", "w") as f:
    json.dump(data, f, indent=2)

# Step 2: Upload to 4MINDS
print("Uploading dataset to 4MINDS...")
with open("finqa_training_data.json", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/user/dataset",
        headers=headers,
        files={"file": ("finqa_training_data.json", f, "application/json")},
        data={"name": "FinQA Training Data"}
    )
dataset_id = response.json()["id"]

# Step 3: Create model
print("Creating model...")
response = requests.post(
    f"{BASE_URL}/user/model",
    headers={**headers, "Content-Type": "application/json"},
    json={
        "name": "Financial QA Assistant",
        "description": "Fine-tuned on FinQA for financial Q&A",
        "dataset_id": dataset_id
    }
)
model_id = response.json()["id"]

# Step 4: Wait for training
print("Waiting for training to complete...")
while True:
    response = requests.get(f"{BASE_URL}/user/model/{model_id}", headers=headers)
    status = response.json()["status"]
    if status == "ready":
        break
    time.sleep(30)

# Step 5: Test the model
print("Testing model...")
response = requests.post(
    f"{BASE_URL}/user/model/{model_id}/inference",
    headers={**headers, "Content-Type": "application/json"},
    json={"message": "What was the revenue growth percentage?"}
)
print(f"Response: {response.json()['response']}")

Store your API key in environment variables rather than hardcoding it. Use os.environ.get("FOURMINDS_API_KEY") for production scripts.

Intro

Getting Started with 4Minds

4Minds Features

What's New in 4Minds

Key Terms and Concepts in 4Minds

Developer Tools

​Supported File Types

​Automatic OCR Processing

​Upload Size Limit

​Adding Data to Existing Datasets

​Why Continuous Data Updates Matter

​How to Add New Data

​From the Datasets tab

​From the Model tab

​From the Control Center

​Best practices for Data Updates

​Building Comprehensive Datasets

​Example: Training on Customer Support Excellence

​Why this Matters

​Tutorial: Fine-Tune a Model with Hugging Face Datasets

​What you’ll build

​Prerequisites

​Fine-tuning overview

​Model status types

​Selecting a base model

​Training data best practices

​Step 1: Access the data upload screen

​Step 2: Select Hugging Face integration

​Step 3: Search for your dataset

​Step 4: Configure dataset import settings

​Step 5: Review attached files

​Step 6: Review and launch training

​Step 7: Monitor training progress

​Step 8: Test in the Playground

​Model actions

​API integration

​Performance optimization

​Troubleshooting

​FAQs

​Supported file formats

​Tips

​Tutorial: Fine-Tune a Model with Hugging Face Datasets via API

​What you’ll build

​Prerequisites

​Step 1: Download the FinQA dataset from Hugging Face

​Step 2: Upload the dataset to 4MINDS

​Step 3: Create a model with the dataset attached

​Step 4: Monitor training progress

​Step 5: Test your model via API

​Complete script

Supported File Types

Automatic OCR Processing

Upload Size Limit

Adding Data to Existing Datasets

Why Continuous Data Updates Matter

How to Add New Data

From the Datasets tab

From the Model tab

From the Control Center

Best practices for Data Updates

Building Comprehensive Datasets

Example: Training on Customer Support Excellence

Why this Matters

Tutorial: Fine-Tune a Model with Hugging Face Datasets

What you’ll build

Prerequisites

Fine-tuning overview

Model status types

Selecting a base model

Training data best practices

Step 1: Access the data upload screen

Step 2: Select Hugging Face integration

Step 3: Search for your dataset

Step 4: Configure dataset import settings

Step 5: Review attached files

Step 6: Review and launch training

Step 7: Monitor training progress

Step 8: Test in the Playground

Model actions

API integration

Performance optimization

Troubleshooting

FAQs

Supported file formats

Tips

Tutorial: Fine-Tune a Model with Hugging Face Datasets via API

What you’ll build

Prerequisites

Step 1: Download the FinQA dataset from Hugging Face

Step 2: Upload the dataset to 4MINDS

Step 3: Create a model with the dataset attached

Step 4: Monitor training progress

Step 5: Test your model via API

Complete script