Created at 9am, Feb 2
benjaminBook
0
Efficient Exploration for LLMs, a Paper by Google & Stanford University
3zfCNISgH8NiyfiMUm9U9b3I46F4nHkgrmxi_xgDGOA
File Type
PDF
Entry Count
69
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

Abstract of the Paper: We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles.

Original Paper: https://arxiv.org/abs/2402.00396

For the Boltzmann exploration scheme, we swept over several temperatures and found that small temperatures produced best results. A similar level of performance was achieved by a variant of Boltzmann scheme that selects one of the response greedily and the second response using Boltzmann. More details can be found in Appendix C. In the case of infomax, we used 30 epistemic indices to compute means and variances. For double TS agent, we set the maximum number of attempts at producing a distinct second response to = 30. Appendix B presents further detail on our hyperparameter selection process.
id: c6fa296aecdc0f86871cc3d244c8bf55 - page: 9
5.1. Assessment of Exploration Algorithms Figure 5 plots win rates of each agent across different numbers of epochs of interactions. The results, obtained by averaging across 5 random seeds, clearly demonstrate that actively exploration accelerates learning and results in higher win rates. Notably, the double TS agent emerges as the top performer. We observe that infomax performs very well over early epochs but later falls far short of double TS. This divergence may be due to infomaxs inclination to seek information, irrespective of whether that information is helpful in desirable responses.
id: 818cc4f82927da3fbde85ca6c91b23bc - page: 9
Each of the performance curves in Figure 5 appears to converge, while one would hope for continued improvement as the volume of human interaction grows. Reward model capacity which can be thought of loosely as the effective number of parameters learned from feedback gaits the degree of improvement. For any capacity, one would expect convergence as the number of queries grows. Increasing the capacity enables further improvement at the cost of increased computation. This relates to the notion explained by Arumugam & Van Roy (2021) that it is beneficial to moderate the complexity of a learning target based on the duration over which an agent expects to explore. 5.2. Scaling with the Volume of Feedback
id: 7d46fa8a61c91cc3694f6ff76f700083 - page: 9
Figure 1, reproduced from Section 1 for convenience, plots the number of queries required by alternatives to match the performance of double TS, which we found to be most efficient among exploration algorithms we considered. While the plots are not conclusive, we discern that they are concave. Suppose we measure the advantage of efficient exploration in terms of the percentage reduction in data required to attain any given level of performance. Concavity of the plots in Figure 1 implies that, as the scale of human feedback data grows, so does the advantage afforded by efficient exploration. For the level of performance attained by 30, 000 passive queries, double TS 9
id: 6abffdd77c7184f1dfa48558e1ea0d55 - page: 9
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "3zfCNISgH8NiyfiMUm9U9b3I46F4nHkgrmxi_xgDGOA", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "3zfCNISgH8NiyfiMUm9U9b3I46F4nHkgrmxi_xgDGOA", "level": 2}'