Over a decade ago, automatic item generation (AIG) was introduced to meet the increasing need for high-quality items in educational measurement. Around the same time, a new area of research in computer science began to develop questions for educational use. Historically, researchers from these two domains had little knowledge or communication with one another. However, the development of pre-trained large language models (LLMs) has sparked the interest of researchers from both domains in applying these models for automatically creating items. With similar objectives and methodologies, these two research domains appear to be converging on how to address the problems in this field. The purpose of this study is to provide a review of the current state of research by synthesizing existing studies on the use of LLMs for AIG. By combining research from both domains, the authors examine the utility and potential of LLMs for AIG. Tan, Bin & Armoush, Nour & Mazzullo, Elisabetta & Bulut, Okan & Gierl, Mark. (2024). A Review of Automatic Item Generation Techniques Leveraging Large Language Models. 10.35542/osf.io/6d8tj.
g., delivering feedback or evaluating achievement). We further extracted sentences containing the keywords. However, we found that many occurrences of these keywords were not related to the measurement properties of the generated items. For instance, most descriptions of reliability were in contexts other than the reliability of items or assessment results. They commonly referred to terms like inter-annotator reliability and the reliability of the data collection. Similarly, the instances of fairness were exclusively related to the fairness of experiments comparing model performance, rather than to the fairness of the assessment items themselves. Therefore, the issue of lacking sufficient consideration for the measurement properties of items is even more severe than it appears in Figure 3.
id: 86768ff5f89e551184a3b3061e2ae6ba - page: 22
Summary of Findings for RQ3 We found that LLMs can be an effective and flexible solution to generating a large number of items, with few constraints on item type, language, subject domain, or the data source used for training LLMs to create items. However, we did not find many studies reporting the measurement properties of the generated items. 21 Figure 3. Number of Studies Containing Each Keyword Discussion
id: 313b2792032fd4ebd81aa7117cd31f3f - page: 22
Technological advancements such as e-learning platforms and computer-based assessments have ushered in unprecedented learning opportunities for students, transforming traditional educational practices and assessments. This transformation necessitates a substantial demand for high-quality assessment items, which are vital for supporting student learning and effectively evaluating educational outcomes (Bulut & Yildirim-Erbasli, 2022; Mazzullo et al., 2023). Therefore, AIG has been proposed and gradually developed by measurement researchers as a solution to reduce the cost of developing a large number of assessment items (Alves et al., 2010; Gierl & Lai, 2015; Gierl et al., 2021; Lai et al., 2009). Concurrently, recent advancements in LLMs have allowed applied NLP researchers to create items for educational purposes (Akyon et al., 2022; Pochiraju et al., 2023). As research in both domains expands, their intersection 22
id: 938c7c20302aca4b12ea01f1ba170ade - page: 23
To provide a more comprehensive view of research on the use of LLMs in AIG, this review aims to map the current literature by bringing together experiences from both research domains to explore the utility and potential of LLMs for AIG.
id: 2182394aec9f58d0098a325574ea7106 - page: 24