Created at 9am, Mar 5
Ms-RAGArtificial Intelligence
0
Mitigating Reversal Curse via Semantic-aware Permutation Training
I8zB-ZyDNunbGb8J-wm0_g1eEf3KWxUkP0H0geKCRxs
File Type
PDF
Entry Count
70
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

Qingyan Guo, Rui Wang, Junliang Guo, Tan Xu, Jiang Bian, Yujiu YangWhile large language models (LLMs) have achieved impressive performance across diverse tasks, recent studies showcase that causal LLMs suffer from the“reversal curse”. It is a typical example that the model knows “A’s father is B”, but is unable to reason “B’s child is A”. This limitation poses a challenge to the advancement of artificial general intelligence (AGI),as it suggests a gap in the models’ ability to comprehend and apply bidirectional reasoning. In this paper, we first conduct substantial evaluation and identify that the root cause of the reversal curse lies in the different word order between the training and inference stage, namely, the poor ability of causal language models to predict antecedent words within the training data. Accordingly, permutation on the training data is considered as a potential solution, since this can make the model predict antecedent words or tokens. However, previous permutation methods may disrupt complete phrases or entities, thereby posing challenges for the model to comprehend and learn from training data. To address this issue, we propose Semantic-aware Permutation Training (SPT), which addresses this issue by segmenting the training sentences into semantic units(i.e.,entitiesorphrases)withanassistant language model and permuting these units before feeding into the model. Extensive experiments demonstrate that SPT effectively mitigates the reversal curse since the performance on reversed questions approximates that on the forward ones, and significantly advances the performance of existing works.

5.1 Settings to a large extent while maintaining that the performance on the forward questions does not drop significantly (compared with the models trained by standard data in Table 3). Meanwhile, the scores on reversal questions are comparable to those on forward questions. We employ the open-source Vicuna-13b-v1.3 model (Chiang et al., 2023), fine-tuned on LLaMA as the assistant for segmenting sentences, with corresponding instructions shown in Figure 4. Then, we continue-train LLaMA-7B (Touvron et al., 2023) by semantic-aware permutation training (Eq. 1). See Appendix A.2 for more parameters. Model Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
id: ecfcf8bb2e1e25a9e2f140edb57f4108 - page: 6
M1 M2 M3 M4 97.75 71.78 90.09 64.97 97.82 68.01 89.82 63.32 94.86 98.37 84.82 97.11 94.35 96.61 78.17 96.36 95.77 93.59 89.29 96.03 95.51 92.07 84.6 95.44 94.98 95.18 90.88 96.96 94.91 94.32 92.13 97.36 Table 6: Accuracy of questions Q1-Q8 for models M1M4 trained by SPT with different data formats. SPT is trained either on the original sentence, reversed or permuted chunks after segmentation by the assistant model, with the probability of 1 3 for each. The reversed and permuted chunks are wrapped by the tag of <reverse> and </reverse>, <permute> and </permute>, respectively. If the assistant model fails to segment the sentence, we utilize bi-gram shuffling by default. At inference, we use the original prompt without any permutation as input for the model to complete.
id: 42d4d8409639a3613b9cff051497f4d0 - page: 6
5.2 Results We use three datasets proposed by Berglund et al. (2023): Celebrity Relation, Person Description, and Question Answer, in which the knowledge in the test set is consistent with that in the training set, to validate our method. Celebrity Relation We use the same formats of data as in Section 3. Then we segment the sentences into semantic-aware chunks in D1-D4 (Table 1) and train the corresponding models with the same hyper-parameters, denoted as M1-M4. The results are reported in Table 6. We can see that SPT effectively mitigates the reversal curse
id: 02581dcf421435a7db87fde28e0852d9 - page: 6
Person Description This dataset is generated by GPT-4. Composed of three subsets (D1, D2 and D3), the training set includes 3,600 sentences in the form of <person> is <description> (pi di), or <description> is <person> (di pi)3. D1 includes data of Person2Description, denoted as p1-d1, and reversal Description2Person set, d1-p1. Similarly, D2 is composed of d2-p2 and p2-d2. D3, denoted as d3 p3, includes data of the two formats and helps the model to generalize. The model is trained on d1-p1, p2-d2 and D3, and tested on d1-p1, p1-d1, d2-p2 and p2-d2. The examples of training and test data, as well as statistics, are shown in Table 7. As shown in Table 8, we compare our SPT on four subsets, Description2Person (d1-p1) and the corresponding reversal data (p1-d1), Person2Description (d2-p2) and the reversal data (p2d2), with following baselines: 1) BICO (Lv et al., 2023) introduces the bi-directional attention mech-
id: 60bae464ef8020ffa1679d943f818458 - page: 6
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "I8zB-ZyDNunbGb8J-wm0_g1eEf3KWxUkP0H0geKCRxs", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "I8zB-ZyDNunbGb8J-wm0_g1eEf3KWxUkP0H0geKCRxs", "level": 2}'