Dria: Decentralized AI Network for Advanced Synthetic Data Generation by FirstBatch

A culture-driven, democratic, and human-oriented AI is not the direction we’re currently heading. However, decentralization provides tools that can lay the foundation for progressing in the right direction.

Dria untaps the potential of a distributed network of LLMs, large and small, by building upon a digital economy that values every parcel of compute available in idle devices scattered around the globe. A globe that is so vast, and rich in energy, compute, and diversity!

Value

Dria’s primary objective is to leverage decentralization to obtain state-of-the-art synthesized data. As data fuels every new iteration of LLMs—from smaller task-driven models to large models capable of complex reasoning, from case-specific RAG applications to multilingual models—synthetic data is an entire ecosystem. This ecosystem requires diverse pipelines and the orchestration of different models, particularly for privacy concerns.

What does leveraging decentralization mean?

Creating value for AI that can only be obtained through decentralization.

Unbounded scrape limits
Hyper parallelization of inference
Ability to include human diversity
Creating economic value, incentivization
Ensembling outputs from diverse models

Although not strictly tied to decentralization, diversity in LLMs is a crucial aspect of Dria.

Diversity in LLMs, inheriting different preferences ( https://arxiv.org/abs/2406.04692 , https://arxiv.org/abs/2407.01490 ) can significantly contribute to the quality of synthetic data, boosting creativity and increasing the generation of low probability tokens.

Hyper parallelized

Dria’s core value lies in its ability to run hundreds of small/medium LLMs capable of function calling and executing long, complex tasks by breaking them down through ‣.

Ollama Workflows is a framework for programming LLMs through JSON-based workflows, enabling them to break down complex tasks into smaller sub-tasks using tools and program memory. It addresses challenges with agentic frameworks and local LLMs, offering clear workflows and prompts for better task execution on consumer hardware, especially with smaller LLMs and limited resources.

The design of Ollama Workflows is heavily inspired by the LLM OS created by Andrej Karpathy, enabling users to execute any task by providing task-specific programs. Ollama Workflows does not require chat history and allows attaching specific prompts for each subtask along with I/O from memory.

Scrape limits

The distribution of agents removes limiting factors such as internet access and rate limits, enabling massive web scraping, data extraction, formatting, and reasoning on collected data. Dria builds its pipelines by augmenting data processes, grounding synthetic data with accurate research autonomously conducted by the Dria network, or the Dria Knowledge Network.

Synthetic data generation in the Dria network is grounded in real-world information. Nodes locate and extract relevant, accurate data to guide the generation process. This grounding involves finding and scraping appropriate resources in the correct format, then utilizing them flexibly without contextual constraints.

In a centralized system, this problem is challenging to solve up to a certain scale, beyond which it becomes nearly impossible.

Dria nodes excel at executing complex workflows, combining multiple tasks in a single operation. For example, one node can research a topic online, store content in its vector database, search vast datasets for relevant context, create research notes, and generate questions based on given context. However, this approach lacks efficiency in terms of parallelization. DKN nodes offer similar capabilities while potentially allowing for better distributed processing.

Human diversity

Recent studies indicates there is a lot of hidden value in persona-driven methodologies for synthetic data creation. By harnessing diverse perspectives, this approach generates high-quality, varied data. These personas serve as distributed knowledge carriers, enabling LLMs to access a wide range of contexts. This facilitates the creation of tailored, diverse synthetic data for various applications, including complex mathematical problems, logical reasoning tasks, user instructions, knowledge-rich texts, and game NPCs. Persona-driven synthesis offers versatility, scalability, and flexibility, potentially transforming LLM research and development. ( https://arxiv.org/html/2406.20094v1 , https://arxiv.org/html/2312.10007v1 )

Dria Network can incorporate human diversity in synthetic data generation. Two main approaches are:

Human-in-the-loop:
- Node runners provide feedback on generated samples, guiding the process.
- Challenges include expert knowledge requirements, mitigated by: a) Providing context for feedback b) Using consensus/validation mechanisms human minizing error c) Assigning tasks to multiple node runners
Though some challenges persist
Persona Integration:

Dria's secondary research focuses on incorporating node runners' personas.

Long-term dialogues between local model and node runner to model unique persona of the node runner. Dria’s objective is to bring real-world human personas into the system with this research, enhancing data generation with authentic human characteristics and diversity.

Abilities

Dria employs a swarm of LLMs to synthesize data using search. Custom workflows enable each node to focus on specific subtasks. Nodes conduct targeted research, identifying and extracting relevant data for their assigned subtask, then synthesize findings into a custom format.

Nodes can search through web or siloed data through APIs, efficiently locating, evaluating, and extracting the most pertinent information from any website or database.

Dria facilitates pipeline implementation via Admin Nodes. These pipelines can leverage the swarm for diverse custom applications.

Example pipeline:

Task: Generate JSON-formatted book reviews focusing on character development in fiction.

Identify fiction book list
Locate relevant forums and discussions
Validate sources
Filter content addressing character development
Verify filtered information
Format data in JSON
Aggregate results

This streamlined process efficiently collects, validates, and structures targeted book review data by employing Dria nodes, with hyper parallelization.

Even though custom abilities are boundless, Dria will provide out-of-box uses cases:

Q&A: Automated RAG & Fine tuning

Retrieval-augmented generation (RAG) is essential for developing tailored AI applications with domain-specific knowledge integration.

Dria's network enables users to input custom documents, datasets, and podcasts for Q&A data generation. By simulating a user base, Dria creates question-answer pairs from the provided content, laying the groundwork for autonomous RAG. This process addresses the critical initial step of data generation.

For fine-tuning data generation, Dria extends this approach to dialogues. Leveraging the persona integration capabilities outlined earlier, nodes can generate diverse, multi-turn conversations grounded in the provided custom content. These dialogues simulate real-world interactions between users with varied backgrounds (represented by personas) and an AI assistant knowledgeable about the input material.

The network's web search functionality further enhances the grounding of synthesized data. Nodes can augment the custom documents with relevant, up-to-date information from the internet, ensuring the generated dialogues reflect a broader context. This combination of persona-driven dialogue generation and web-searched context results in rich, diverse, and factually grounded datasets ideal for fine-tuning domain-specific models.

Multi-language

Synthesizing multilingual data poses challenges due to base models' inferior performance in generating non-English text compared to English. This imbalance impacts the quality and accuracy of multilingual natural language processing tasks.

Leveraging multi-language web data while minimizing generation significantly mitigates semantic and lexical issues in multilingual data synthesis. Dria agents excel at conducting multi-language research, enabling effective synthesis of multilingual data with minimal generation.

Truthfulness Validation

LLMs hallucinate, potentially producing unreliable outputs even when augmented with retrieval-based techniques and preference optimization.

Dria Knowledge Network (DKN) serving as a tool for validating the truthfulness of information generated by LLMs. By leveraging the swarm, Dria can systematically verify claims and statements against reliable sources.

It cross-references external sources and databases to validate information, ensuring thorough fact-checking and verification of AI-generated content. Custom validation is possible by providing Dria with specific sources of truth.

Organizations and media outlets require robust fact-checking mechanisms for AI-generated content to maintain credibility and combat misinformation. Dria's truthfulness validation capability provides a scalable, programmable solution to meet this critical need.

Finding the Right Model

Dria's capabilities focus on selecting the optimal open or closed-source model for specific tasks. Large Language Models (LLMs) are versatile, applicable to various functions such as classification, chat, tool calling, and summarization. While model performance often correlates with parameter size, other factors like training data and pre/post-training processes can significantly influence effectiveness, even among models of equal size. In some instances, smaller models may outperform larger ones for particular tasks.

Dria's process involves:

Generating small-scale synthetic data for the target task
Evaluating baseline performance of various Hugging Face, OpenAI, and Anthropic models within a selected size range
Utilizing Reflexion-like feedback to refine sample accuracy ( https://arxiv.org/abs/2303.11366 )
Optimizing prompts with dspy for powerful in-context learning ( https://arxiv.org/abs/2310.03714 )
Re-evaluating model performance using enhanced prompts

The final output includes comprehensive evaluations and optimized prompts.

Guardrails

Dria Knowledge Network (DKN) acting as a firewall for LLMs, filtering/restructuring generations by large models based on specific policies. Dria nodes can verify if given output is compliant a certain policy, even reformatting output by small changes making sure it fits into policy.

Dria enables guardrailing outputs at massive scale, handling large batch sizes. It also leverages external documents to validate output compliance.

Companies and organizations utilizing LLMs demand 100% policy compliance for all outputs. This necessitates a secondary, easily programmable firewall system.

We will be publishing more of our research in the coming weeks so follow us on X and join our Discord to stay updated.