Best Practices and Lessons Learned on Synthetic Data for Language Models

Created at 4pm, Apr 30

Kerim-Kaya

Artificial Intelligence

Contract ID

JPOR4ffroFTeR_JsVkaBG2os9HEtPQVfNg9UFLjlOXA

File Type

PDF

Entry Count

129

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

The success of AI models relies on the availability of large, diverse, and high-quality datasets, whichcan be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data hasemerged as a promising solution by generating artificial data that mimics real-world patterns. Thispaper provides an overview of synthetic data research, discussing its applications, challenges, andfuture directions. We present empirical evidence from prior art to demonstrate its effectiveness andhighlight the importance of ensuring its factuality, fidelity, and unbiasedness. We emphasize the need forresponsible use of synthetic data to build more powerful, inclusive, and trustworthy language models.

L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, and D. Wingate. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337351, 2023. A. Asai, X. Yu, J. Kasai, and H. Hajishirzi. One question answering model for many languages with cross-lingual dense passage retrieval. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 75477560, 2021. URL 3df07fdae1ab273a967aaa1d355b8bb6-Abstract.html. A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma, et al. A general language assistant as a laboratory for alignment. ArXiv preprint, abs/2112.00861, 2021. URL

id: af58dce3f23dd0a3ce612254e4878a8e - page: 12

S. A. Assefa, D. Dervovic, M. Mahfouz, R. E. Tillman, P. Reddy, and M. Veloso. Generating synthetic data in finance: opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, pages 18, 2020. Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An open language model for mathematics. ArXiv preprint, abs/2310.10631, 2023. URL R. Babbar and B. Schlkopf. Data scarcity, robustness and extreme multi-label classification. Machine Learning, 108(8):13291351, 2019. Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. ArXiv preprint, abs/2212.08073, 2022. URL E. Barbierato, M. L. D. Vedova, D. Tessera, D. Toti, and N. Vanoli. A methodology for controlling bias

id: a2d918e2a0b45a5f87488b228eef84ec - page: 12

Applied Sciences, 12(9):4619, 2022. W. Bi, H. Li, and J. Huang. Data augmentation for text generation without any augmented data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 22232237, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. acl-long.173. URL S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. van den Driessche, J. Lespiau, B. Damoc, A. Clark, D. de Las Casas, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, 12

id: 0e852b0e98111d1004440a4978cbc578 - page: 12

Best Practices and Lessons Learned on Synthetic Data for Language Models L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Paganini, G. Irving, O. Vinyals, S. Osindero, K. Simonyan, J. W. Rae, E. Elsen, and L. Sifre. Improving language models by retrieving from trillions of tokens. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvri, G. Niu, and S. Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 22062240. PMLR, 2022. URL V. Borisov, K. Seler, T. Leemann, M. Pawelczyk, and G. Kasneci. Language models are realistic tabular data generators. ArXiv preprint, abs/2210.06280, 2022. URL 06280.

id: 553835e2e53fc002b0db7287656c2979 - page: 13

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "JPOR4ffroFTeR_JsVkaBG2os9HEtPQVfNg9UFLjlOXA", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "JPOR4ffroFTeR_JsVkaBG2os9HEtPQVfNg9UFLjlOXA", "level": 2}'