Core Concepts in Datafast¶
Guiding Design Principle¶
Less Is More: We seek data quality, data diversity and reliability over quantity. We don't measure success by shipping more features: we succeed if it works when you try it out.
This is a reason why we don't support a lot of LLM provider. We focus on what we know will work well for you.
LLM Provider Abstraction¶
Datafast implements an LLM provider abstraction layer that decouples dataset generation logic from specific LLM implementations. This architecture enables:
- Model Interchangeability: Swap between different LLM providers without changing your dataset generation code
- Multi-Model Generation: Use multiple LLMs in parallel to increase output diversity and quality
- Unified Interface: Interact with all supported LLMs through a consistent API regardless of their underlying implementation differences
Dataset Configuration¶
Datafast uses simple configuration objects to define your dataset requirements. Instead of writing complex code to set up your dataset, you just fill in a configuration with the details you need:
# Example of a simple configuration for a classification dataset
config = ClassificationDatasetConfig(
# Define what classes/categories you want
classes=[
{"name": "positive", "description": "Text expressing positive emotions"},
{"name": "negative", "description": "Text expressing negative emotions"}
],
# How many examples per prompt
num_samples_per_prompt=5,
# Where to save the result
output_file="sentiment_dataset.jsonl",
# What languages to generate
languages={"en": "English", "fr": "French"}
)
This approach makes it easy to create, modify, and share dataset specifications without changing any underlying code.
Architectural Components¶
LLM Providers¶
The provider layer acts as an abstraction over different LLM APIs and services:
- Integration Mechanism: Utilizes LiteLLM for unified communication with various LLM services
- Provider Classes: Each supported provider (OpenAI, Anthropic, Gemini, Ollama) has a dedicated class handling authentication, request formatting, and response parsing
- Structured Output: Supports generation of structured data through Pydantic models for consistent and predictable output formats
Dataset Classes¶
Each dataset type is implemented as a specialized class that:
- Processes Configuration: Validates and prepares generation parameters
- Manages Prompt Creation: Constructs effective prompts based on the dataset type and configuration
- Orchestrates Generation: Orchestrates the LLM calls and data collection process
- Output Formatting: Formats and saves the generated data in standardized formats
Prompt Expansion System¶
The prompt expansion system is key and enables:
- Combinatorial Variation: Automatically creates multiple prompt variations through placeholder substitution
- Diversity Control: Provides mechanisms to balance between comprehensive coverage and targeted sampling
- Efficiency: Reduces manual prompt engineering effort while maximizing dataset diversity
Conceptual Workflow¶
The datafast workflow follows a consistent pattern across all dataset types:
- Configuration: Define the dataset parameters, classes/topics, and generation settings
- Prompt Design: Create base prompts with mandatory and optional placeholders
- Provider Setup: Initialize one or more LLM providers
- Generation: Execute the generation process, which:
- Expands prompts based on configuration
- Distributes generation across providers
- Collects and processes responses
- Output: Save the resulting dataset to a file and optionally push to Hugging Face Hub
Dataset Diversity Mechanisms¶
Datafast employs multiple strategies to maximize dataset diversity:
Multi-Dimensional Variation¶
Each dataset type supports some variations across multiple dimensions:
- Content Dimensions: Topics, document types, classes, personas, etc.
- Linguistic Dimensions: Multiple languages, writing styles, formality levels
- Structural Dimensions: Different formats, lengths, complexities
Implementation Patterns¶
Structured Data with Pydantic¶
Datafast uses Pydantic extensively for:
- Configuration Validation: Ensuring all required parameters are present and valid
- Response Parsing: Converting LLM text outputs into structured data objects
- Type Safety: Providing type hints throughout the codebase for better IDE support and error catching
Asynchronous Generation¶
Coming soon
Intended use cases¶
- Get initial evaluation text data instead of starting your LLM project blind.
- Increase diversity and coverage of another dataset by generating additional data.
- Experiment and test quickly LLM-based application PoCs
- Make your own datasets to fine-tune and evaluate language models for your application.
Design Considerations¶
When working with datafast, consider these fundamental design aspects:
Cost Efficiency¶
Synthetic data generation involves API costs. Datafast provides mechanisms to control costs:
- Sampling: Generate representative subsets rather than exhaustive combinations with combinatorial explosion
- Provider Tiers: Use less expensive models for initial testing, premium models for final datasets
Customization¶
Datafast is designed to be flexible and customizable to meet specific requirements:
Custom Prompts¶
Each dataset type supports custom prompts that allow you to tailor the generation process:
config = RawDatasetConfig(
# Basic configuration
document_types=["research paper", "blog post"],
topics=["AI", "climate change"],
num_samples_per_prompt=2,
output_file="output.jsonl",
languages={"en": "English"},
# Custom prompt with placeholders
prompts=[
"Generate {num_samples} {document_type} in {language_name} "
"about {topic} with {{tone}} tone for {{audience}}."
]
)
Placeholder Systems¶
Datafast uses two types of placeholders for flexible prompt design:
- Mandatory Placeholders (
{like_this}
): Required by the dataset type and filled automatically - Optional Placeholders (
{{like_this}}
): User-defined for creating prompt variations
Expansion Configuration¶
Control how your prompts are expanded with the PromptExpansionConfig
:
expansion_config = PromptExpansionConfig(
placeholders={
"tone": ["formal", "casual", "technical"],
"audience": ["beginners", "experts", "general public"]
},
combinatorial=True, # Generate all combinations
# OR
# combinatorial=False, # Use random sampling instead
# num_random_samples=100 # Number of random combinations
)
Multi-Provider Strategy¶
Use different LLM providers for different aspects of dataset generation:
# Example for preference dataset generation
dataset.generate(
question_gen_llm=OpenAIProvider(),
chosen_response_gen_llm=AnthropicProvider(),
rejected_response_gen_llm=GeminiProvider()
)
This approach allows you to leverage the strengths of different LLMs for specific generation tasks.