Bo Li 1,4, Qinghua Zhao2,3 , Lijie Wen11- Tsinghua University, Beijing, China 2- Beihang University, Beijing, China3- University of Copenhagen, Copenhagen, Denmark 4- Baidu Inc., Beijing, Chinali-b19@mails.tsinghua.edu.cn, zhaoqh@buaa.edu.cn, wenlj@tsinghua.edu.cnAbstractProbing the memorization of large language models holds significant importance. Previous works have established metrics for quantifying memorization, explored various influencing factors, such as data duplication, model size, and prompt length, and evaluated memorization by comparing model outputs with training corpora. However, the training corpora are of enormous scale and its pre-processing is time consuming. To explore memorization without accessing training data, we propose a novel approach, named ROME 1, wherein memorization is explored by comparing disparities across memorized and non-memorized. Specifically, models firstly categorize the selected samples into memorized and non-memorized groups, and then comparing the demonstrations in the two groups from the insights of text, probability, and hidden state. Experimental findings show the disparities in factors including word length, part-of-speech, word frequency, mean and variance, just to name a few.
(c) part-of-speech (POS) Figure 4: Sample counts comparison between memorized versus non-memorized instances on the IDIOMIM dataset, regarding word number of idiom, length of the last word, and part-of-speech (POS). swered childs name. Please refer to the last three paragraphs in Subsection 3.1 for more details. set at 2048. The model employs the SiLU activation function, and its weight matrices are initialized with a standard deviation of 0.02.
id: fb5fee317b49f616f87779624f878b77 - page: 6
Generation. For the evaluated models, we employ code provided by HuggingFace. During the models generation process, the generation code from HuggingFace combines a variety of decoding strategies 4, such as greedy decoding (Germann, 2003), beam search (Freitag and Al-Onaizan, 2017), top-K sampling (Fan et al., 2018), top-p sampling (Holtzman et al., 2020), contrastive search (Su et al., 2022), etc., but does not provide access to probabilities and hidden states. To acquire probabilities and hidden states, we manually implemented the greedy decoding strategy during the models generation process. Greedy decoding is a classic and straightforward decoding method, it selects the token with the highest probability as its next token at each time step. 4 Experimental Analysis
id: a5e6aa436a8285854bd0338e6b91fcf8 - page: 6
4.1 Parameter Settings. Our experiments were conducted using the LLaMA 2 model, trained on an extensive 2 trillion token corpus that includes diverse datasets like English CommonCrawl (Wenzek et al., 2020), C4 (Raffel et al., 2020), Github, Wikipedia, Gutenberg and Books3 (Gao et al., 2020), ArXiv, and StackExchange, among others (source: https: //ai.meta.com/llama). For our inference tests, we utilized the 7B size version, operating on a single 32GB V100 GPU with the model loaded in float32 precision. We also tested on LLaMA 2 13B, considering its consistent tendency with 7b, we omit it. Key specifications of the model include a vocabulary size of 32,000, a hidden size of 4096, an intermediate size of 11008, 32 layers, and 32 attention heads. The maximum position embedding was
id: 7ffabd18e7cfcd8ba5a9371da72a033f - page: 6
4.2 Text-oriented Analysis. To investigate whether there are distinct disparities between memorized and non-memorized samples in terms of word number of idiom, length of the last word, and pos, upon extracting the aforementioned features, we counted the number and proportion of memorized and non-memorized samples under these features, as shown in Figure 4. Longer prompt more memorized. Figure 4a demonstrates the relationship between the word number of idiom and its likelihood of being memorized or non-memorized. It shows that the proportion of non-memorized idioms initially drops but stabilizes at zero for idioms longer than 9 words. That is, the longer the idiom is, the higher probability to be memorized. This pattern suggests that the length of prompt has influences on models generation, and longer prompt are more possible to be memorized or tend to help model recall more memorized knowledge.
id: 2a8228270af1db541a91f06760f6e423 - page: 6