Datapipes

Learn how to use LLMs and create datasets
with simple and reproducible notebooks.

Note: All notebooks checked and updated last June 12, 2025.

A convenient way to create datasets

📊 Quickstart: Text Classification

Learn how to create text classification datasets with Datafast.

🔗 Datafast GitHub Repository

Visit the Datafast GitHub repository for more information and resources.

Work in progress, more guides coming...

🚀 How to use an OpenAI Chat model

Quick start guide to implementing OpenAI Chat models in your project.

🧠 How to use Anthropic Claude model

A brief intro to using Anthropic’s Claude for chat or text completions.

🤖 How to use Google Gemini model

Learn how to integrate Google’s Gemini into your workflow.

Work in progress, more guides coming...

📝 Structured Output with OpenAI

Generate structured JSON or XML responses using OpenAI’s chat models.

📝 Structured Output with Anthropic

Harness Anthropic’s Claude for structured data generation.

📝 Structured Output with Google Gemini

Leverage Google’s Gemini for consistent, structured outputs.

Work in progress, more guides coming...

🔧 Simple Question Generation with Distilabel and OpenAI

Create a quick question-generating pipeline using Distilabel + OpenAI.

🔧 Getting Started with Genstruct7B

Build your own text generation pipeline with the Genstruct7B model.

🔧 Create a self-instruct pipeline using Distilabel + OpenAI

Create a self-instruct pipeline using Distilabel + OpenAI.

🔧 Create a Text Classficiation Dataset (fluff detection) and publish dataset

We use the FREE Gemini API, an exisiting dataset of person, and dynamic prompts to generate a diverse dataset and publish it to your huggingface hub.

📚 Everything you need to know to work with the 🤗 Datasets library

The most relevant pieces of the Hugging Face Datasets library documentation, with a focus on text data handling and processing.

Work in progress, more guides coming...

📊 Evaluation 101

Intro to basic metrics for evaluating language model outputs.

Learn how to create and filter datasets for your specific needs

🎯 Semantically Filter Existing Datasets

Learn how to filter existing datasets to kickstart domain-specific projects.