Prompt Expansion¶
Prompt expansion is a powerful technique in datafast to generate diverse outputs by creating multiple versions of prompts. This page explains how to use prompt expansion effectively across different dataset types.
What is Prompt Expansion?¶
Prompt expansion means creating multiple variants of your base prompts template by substituting placeholders with different values. Instead of writing dozens of similar prompts manually, you can define a template with placeholders and automatically expand it into many unique prompts.
For example, a single template like:
Could expand to prompts like: - "Explain how to use Python for web development" - "Explain how to use Python for data analysis" - "Explain how to use JavaScript for web development" - "Explain how to use JavaScript for data analysis"
With two values for programming language and two values for application type, we get four different prompts!
This dramatically increases the diversity of your generated data with minimal configuration.
Key Concepts¶
Placeholder Types¶
Datafast supports two types of placeholders:
-
Mandatory Placeholders (
{mandatory_placeholder}
): These are required placeholders specific to each dataset type and are filled in automatically by datafast based on your configuration. You must include them if you choose to use custom prompt template. -
Optional Placeholders (
{{optional_placeholder}}
): These are user-defined placeholders that expand into multiple variants based on your expansion configuration.
Note
You don't need to create a PromptExpansionConfig
for mandatory placeholders. However, you need to create one for optional placeholders.
Expansion Modes¶
Datafast offers two modes for expanding prompts:
-
Combinatorial Mode (default): Generates all possible combinations of placeholder values, resulting in comprehensive coverage of the prompt space.
-
Random Sampling Mode: Creates a specified number of random variations, useful when the total number of possible combinations would be too large.
How to Configure Prompt Expansion¶
Prompt expansion is configured using the PromptExpansionConfig
class, which is included in your dataset configuration.
from datafast.schema.config import PromptExpansionConfig
expansion_config = PromptExpansionConfig(
# Dictionary mapping placeholder names to lists of possible values
placeholders={
"programming_language": ["python", "javascript", "C++"],
"application_type": ["web development", "data analysis", "machine learning"]
},
# Whether to generate all combinations (True) or random samples (False)
combinatorial=True,
# Number of random samples to generate if combinatorial=False
num_random_samples=100,
# Safety limit to prevent generating too many combinations
max_samples=1000
)
Mandatory Placeholders by Dataset Type¶
Each dataset type in datafast requires specific mandatory placeholders to be included in the dataset prompt templates:
RawDataset¶
{num_samples}
: Number of examples to generate per prompt{language_name}
: Language to generate content in{document_type}
: Type of document to generate{topic}
: Topic to generate content about
You don't need to create a prompt expansion config for mandatory placeholders since datfast
will automatically expand them based on your dataset configuration. However, the mandatory placeholders should be included in any custom prompt templates you define.
Info
For instance, our default prompt template for the RawDataset
is:
MCQDataset¶
{num_samples}
: Number of questions to generate per document{language_name}
: Language to generate questions in{document}
: Source text to generate questions from
UltraChatDataset¶
{num_samples}
: Number of conversations to generate for that topic{language_name}
: Language to generate conversations in{topic}
: Topic for the generated conversations
TextClassificationDataset¶
{num_samples}
: Number of examples to generate per prompt{language_name}
: Language to generate content in{label}
: Label for classification{label_description}
: Description of the label
Usage Examples¶
Basic Example with RawDataset¶
from datafast.datasets import RawDataset
from datafast.schema.config import RawDatasetConfig, PromptExpansionConfig
config = RawDatasetConfig(
# Basic configuration
document_types=["research paper", "blog post"],
topics=["machine learning", "data science"],
num_samples_per_prompt=1,
output_file="output.jsonl",
languages={"en": "English"},
# Custom prompt with both mandatory and optional placeholders
prompts=[
"Generate {num_samples} {document_type} in {language_name} "
"about {topic} with {{tone}} tone and targeted at {{audience}}."
],
# Prompt expansion configuration
expansion=PromptExpansionConfig(
placeholders={
"tone": ["formal", "casual", "technical"],
"audience": ["beginners", "experts", "general public"]
},
combinatorial=True
)
)
This configuration would generate 36 different prompt variations (2 document types × 2 topics × 3 tones × 3 audiences) which would then be used to generate a dataset for 36 rows.
Random Sampling Example with MCQDataset¶
When the number of potential combinations is very large, use random sampling instead:
expansion_config = PromptExpansionConfig(
placeholders={
"difficulty": ["elementary", "high-school", "undergraduate", "graduate", "expert"],
"question_type": ["factual", "conceptual", "analytical", "applied"],
"context_specificity": ["general", "specific"],
"subject_area": ["history", "science", "literature", "mathematics", "art"]
},
combinatorial=False, # Use random sampling
num_random_samples=100 # Generate 100 random combinations
)
This will not generate all possible combinations, but only 100 random combinations. This is useful as the number of possible combinations can explode rapidly.
How Many Outputs Will Be Generated?¶
The number of outputs depends on:
- Base Dataset Configuration: Number of iterations from the dataset-specific parameters
- Number of LLM Providers: Each provider will generate its own set of outputs
- Prompt Expansion: Multiplies outputs based on combinatorial expansion or random sampling
For example, with RawDataset
:
Total outputs = document_types × topics × languages × num_samples_per_prompt × num_providers × expansion_combinations
For MCQDataset
:
Total questions = source_documents × num_samples_per_prompt × languages × num_providers × expansion_combinations
Performance Considerations¶
Keep these tips in mind when using prompt expansion:
- API Costs: More expansions mean more API calls to LLM providers
- Safety Limits: The
max_samples
parameter prevents accidental generation of too many prompts. By default, it is set to 1000. - Prefer Random Sampling: For very large expansion spaces, use random sampling instead of combinatorial expansion
Under the Hood: The expand_prompts
Function¶
Datafast implements prompt expansion using the expand_prompts
function. This function is used internally by dataset classes, but you can also use it directly for custom applications:
from datafast.expanders import expand_prompts
templates = ["Tell me about {{animal}} in {{environment}}"]
placeholders = {
"animal": ["lions", "dolphins", "eagles"],
"environment": ["desert", "ocean", "mountain"]
}
# Combinatorial expansion (all combinations)
expanded = expand_prompts(
prompt_templates=templates,
placeholders=placeholders,
combinatorial=True
)
# Random sampling (5 random combinations)
expanded_random = expand_prompts(
prompt_templates=templates,
placeholders=placeholders,
combinatorial=False,
num_random_samples=5
)
Each expansion returns both the expanded prompt and metadata about which values were used for each placeholder.
Best Practices¶
- Start Small: Begin with a small number of placeholders and values to understand the expansion
- Monitor Expansion Size: Check the expected number of outputs before running the full generation
- Use Meaningful Variations: Choose placeholder values that create meaningfully different prompts
- Balance Coverage vs. Volume: Decide whether complete coverage (combinatorial) or sampling is more appropriate
- Use Multiple Providers: Combine prompt expansion with multiple LLM providers for maximum diversity
Common Pitfalls¶
- Too Many Combinations: Avoid creating too many combinations which can lead to excessive API costs
- Missing Mandatory Placeholders: Ensure all required placeholders for your dataset type are included
- Inconsistent Placeholder Format: Use single curly braces for mandatory placeholders and double for optional ones
- Unclear Expansions: Having too many placeholders can make it hard to interpret the results
- Conflicting Placeholders: Be careful with prompt expansion as some combinations may lead to nonsense!