Large language models (LLMs) are currently at the forefront of intertwining AI systems with human communication and everyday life. This paper introduces a new field of research called machine psychology. The paper outlines how different subfields of psychology can inform behavioral tests for LLMs. It defines methodological standards for machine psychology research, especially by focusing on policies for prompt designs. Additionally, it describes how behavioral patterns discovered in LLMs are to be interpreted
It is likely that the field of machine psychology will sooner or later encounter a replication crisis similar to that experienced in traditional psychology (Open Science Collaboration 2015; Haibe-Kains et al. 2020; Kapoor et al. 2023). In short, the
id: afec886b63e697f25ed56b0c9e31d8bf - page: 7
7 strong downstream effects that prompt designs have on prompt completions make a reasonable set of methods crucial when conducting machine psychology studies. These methods are listed in Table 2, providing a structured approach and guardrails for conducting sound empirical machine psychology studies with LLMs. Methodology Details
id: 363cdfb2f95fe109d6c13ee8b0a24ac1 - page: 7
Avoid training data contamination Many of the studies already conducted in the field of machine psychology have a significant shortcoming in common. They use prompts from psychology studies and apply them to LLMs without changing their wording, task orders, etc. This way, LLMs are likely to have already experienced identical or very similar tasks during training, thus causing LLMs to just reproduce known token patterns (Emami et al. 2020). When adopting test frameworks from psychology meaning vignettes, questionnaires, or other test setups one must ensure that the LLM has never seen the tests before and go beyond mere memorization. Hence, prompts may indeed be structurally like already existing tasks, but they should contain new wordings, agents, orders, actions, etc. Nevertheless, if the prompts refer to real-world scenarios, the tasks should not refer to events that happened after the LLMs training cutoff. Multiply prompts
id: dc9d9586997798868c611275fe97bd61 - page: 8
Machine psychology studies on LLMs represent highly controlled experimental settings without confounding factors affecting human studies. However, a common shortcoming of several existing machine psychology studies is the fact that they rely on small sample sizes or convenience samples, meaning nonsystematic sequences of prompts. Sampling biases, which are especially prevalent in small sample sizes, can diminish the quality of machine psychology studies. This is because slight changes in prompts can already change model outputs significantly. Because of this high sensitivity to prompt wording, it is important to test multiple versions of one task and to create representative samples, meaning batteries of slightly varied prompts. Only this way can one reliably measure whether a certain behavior is systematically reoccurring and generalizable. Multiplying prompts can be done automatically or manually using hypothesis-blind research assistants. Moreover, researchers must of course ensur
id: 6870144cbd527329eb42bf52b173c8aa - page: 8