Why Use Datasets
Fine-Tuning
Create training datasets from your best requests for custom model fine-tuning
Model Evaluation
Build evaluation sets to test model performance and compare different versions
Quality Control
Curate high-quality examples to improve prompt engineering and model outputs
Data Analysis
Export structured data for external analysis and research
Creating Datasets
From the Requests Page
The easiest way to create datasets is by selecting requests from your logs:1
Filter your requests
Use custom properties and filters to find the requests you want

2
Select requests
Check the boxes next to requests you want to include in your dataset

3
Add to dataset
Click “Add to Dataset” and choose to create a new dataset or add to an existing one

Via API
Create datasets programmatically for automated workflows:Building Quality Datasets
The Curation Process
Transform raw requests into high-quality training data through careful curation:1
Collect broadly, then filter
Start by adding many potential examples, then narrow down to the best ones. It’s easier to remove than to find examples later.
2
Review each example

- Accuracy - Is the response correct and helpful?
- Consistency - Does it match the style and format you want?
- Completeness - Does it fully address the user’s request?
3
Remove poor examples
Delete any examples that are:
- Incorrect or misleading responses
- Off-topic or irrelevant
- Inconsistent with your desired behavior
- Edge cases that might confuse the model
4
Balance your dataset
Ensure you have:
- Examples covering all common use cases
- Both simple and complex queries
- Appropriate distribution matching real usage
Quality beats quantity - 50-100 carefully curated examples often outperform thousands of uncurated ones. Focus on consistency and correctness over volume.
Dataset Dashboard
Access all your datasets at helicone.ai/datasets:
Manage all your curated datasets in one place
- Track progress - Monitor dataset size and last updated time
- Access datasets - Click to view and curate contents
- Export data - Download datasets when ready for fine-tuning
- Maintain quality - Regularly review and improve your collections
Exporting Data
Export Formats
Download your datasets in various formats:
Export options for downloading your dataset
Perfect for OpenAI fine-tuning format:Ready to use directly with OpenAI’s fine-tuning API.
API Export
Retrieve dataset contents programmatically:Use Cases
Replace Expensive Models with Fine-Tuned Alternatives
The most common use case - using your expensive model logs to train cheaper, faster models:1
Log high-quality outputs
Start logging successful requests from o3, Claude 4.1 Sonnet, Gemini 2.5 Pro, or other premium models that represent your ideal outputs
2
Build task-specific datasets
Create separate datasets for different tasks (e.g., “customer support”, “code generation”, “data extraction”)
3
Curate for consistency
Review examples to ensure responses follow the same format, style, and quality standards
4
Fine-tune smaller models
Export JSONL and fine-tune o3-mini, GPT-4o-mini, Gemini 2.5 Flash, or other models that are 10-50x cheaper
5
Iterate with production data
Continue collecting examples from your fine-tuned model to improve it over time
Task-Specific Evaluation Sets
Build evaluation datasets to test model performance:- Compare model versions before deploying
- Test prompt changes against consistent examples
- Identify model weaknesses and blind spots
Continuous Improvement Pipeline

Use scores and user feedback to identify your best examples
- Tag requests with custom properties for easy filtering
- Score outputs based on user feedback or automated metrics
- Auto-collect winners into datasets when they meet quality thresholds
- Regular retraining with newly curated examples
- A/B test new models against production traffic
Start small - even 50-100 high-quality examples can significantly improve performance on specific tasks. Focus on one narrow use case first rather than trying to fine-tune a general-purpose model.
Best Practices
Quality over Quantity
Choose fewer, high-quality examples rather than large datasets with mixed quality
Diverse Examples
Include varied inputs, edge cases, and different user types in your datasets
Regular Updates
Continuously add new examples as your application evolves and improves
Clear Criteria
Document what makes a “good” example for each dataset’s specific purpose
Related Features
Custom Properties
Tag requests to make dataset creation easier with filtering
User Metrics
Track which users generate the best examples for your datasets
Sessions
Include full conversation context in your datasets
Feedback
Use user ratings to automatically identify dataset candidates
Datasets turn your production LLM logs into valuable training and evaluation resources. Start small with a focused use case, then expand as you see the benefits of curated, high-quality data.