Diversity, in synthetic and human-written data is a big and apparent problem in AI training. In the specific case of language models, the importance of diversity is highlighted more as small mixtures of diverse & high quality data outperform big mixtures of data, as highlighted in the LIMA, MoDS and DavIR papers. Diversity, as important as it is, is also very challenging to promote in synthetic data generation by large language models.
In GenQA it's stated that most language models "suffer from a lack of randomness" and when asked to generate instruction tuning data, they generate low diversity datasets with many duplicate samples of data. This makes sense, as the whole objective of the LLM is to (to the best of its abilities) correctly model the distribution of its training data, which is a static distribution. The sampling parameters like top_p, min_p and temperature, while useful to introduce diversity, can't go beyond the model's distribution without a sacrifice in quality.
So to when generating large scale synthetic datasets, we need to ensure that the data is diverse and of high quality. To that extent, we started to look for methods that can do this, in a decentralised way. The methods we have found either did not get the "decentralised" part right, or needed static, pre-selected curricula to introduce diversity. In our research, we came across the Classifier-Free Guidance (CFG) method and its application on language models. In that application, the generation of a language model could be steered towards or away from a specific concept, indicated by positive or negative prompts, using a scalar called guidance scale.
Inspired by this, we formulated a method which we named Guided Generation. We extended CFG by incorporating multiple negative prompts. We compute logits for each negative prompt separately, then aggregate them using an element-wise minimum. The overall formula for the logits are:
where is the guidance scale and and are:
Using this modified version of CFG with multiple negative prompts, we came up with a a process to generate large scale diverse synthetic data. We have nodes with the methods generate()
(regular LLM sampling) and generate_guided()
(guided generation) and we have a centralised controller with an integrated vector DB and a select()
method that selects the top prompts from dense clusters in the vector DB and a similar()
method that finds the most similar prompts in the vector DB. The process is as follows:
base_instruction
with generate()
, and share the augmentations with the controller.select()
generate_guided()
, and share the augmentations with the controller.similar()
to check if it's similar to any of the existing augmentations. If it is, the guidance scale is increased, and generate_guided()
is used until no similar augmentations are found (or, a predetermined number_of_iterations
is reached).We used the following metrics to compare our method to the baseline:
1 - S_avg
, where S_avg
is the average cosine similarity across all unique pairs of sequence embeddings in a set.1 - cosine_similarity
. A higher MST cost suggests the embeddings are more dispersed in the semantic space, indicating greater diversity.The results above, which were obtained using microsoft/Phi-4-mini-instruct
, show that our method:
To illustrate the effect of guided generation, we have some examples.
In this first example we can see the effect of the guidance scale: how it can be used with a negative value to steer the generation towards the negative concepts and how it can be used with increasing positive values to steer the generation further away from the negative concepts.
In this second example, we see that adding more and more negative prompts (concepts) to generation process works without problem, and each subsequent generation is semantically distant from the negative promtps and the previous generations.
We also tried whether using few-shot examples would improve the results. We have found that while few-shot prompting improves the metrics slightly, it reduces syntactic diversity significantly, which is also an important measure in the dataset quality, so we decided not to use it.
The method we presented, Guided Generation
, is a decentralised method for generating large scale diverse synthetic data, from a single base instruction. Both in qualitative and quantitative analysis, we saw that it improved upon the baseline in diversity without sacrificing quality. We have also tried sampling-level methods that compared the embedding similarity of candidate generations in advance, but we could not prevent a loss in quality at the expense of promoting diversity. That led us to this method, which is a less invasive one that does not try to perturb the logits directly during sampling. However, future work and research could be done to extend this method to the sampling-level and see if it can improve upon the results we have obtained.
Overall, this method is a proven way to introduce diversity in synthetic data generation, it's a decentralised method that scales to the number of nodes and it's training-free, so we are happy with the current results and the future work that could be done on this method.