Options
Automatic item generation in various STEM subjects using large language model prompting
Loading...
Type
Article
Citation
Chan, K. W., Farhan Ali, Park, J., Sham, K. S. B., Tan, E. Y. T., Chong, F. W. C., Qian, K., & Sze, G. K. (2025). Automatic item generation in various STEM subjects using large language model prompting. Computers and Education: Artificial Intelligence, 8, Article 100344. https://doi.org/10.1016/j.caeai.2024.100344
Author
Chan, Kuang Wen
•
•
•
Sham, Brandon Kah Shen
•
Tan, Erdalyn Yeh Thong
•
Chong, Francis Woon Chien
•
Qian, Kun
•
Sze, Guan Kheng
Abstract
Large language models (LLMs) that power chatbots such as ChatGPT have capabilities across numerous domains. Teachers and students have been increasingly using chatbots in science, technology, engineering, and mathematics (STEM) subjects in various ways, including for assessment purposes. However, there has been a lack of systematic investigation into LLMs’ capabilities and limitations in automatically generating items for STEM subject assessments, especially given that LLMs can hallucinate and may risk promoting misconceptions and hindering conceptual understanding. To address this, we systematically investigated LLMs' conceptual understanding and quality of working in generating question and answer pairs across various STEM subjects. We used prompt engineering on GPT-3.5 and GPT-4 with three different approaches: standard prompting, standard prompting with added chain-of-thought prompting using worked examples with steps, and the chain-of-thought prompting with coding language. The questions and answer pairs were generated at the pre-university level in the three STEM subjects of chemistry, physics, and mathematics and evaluated by subject-matter experts. We found that LLMs generated quality questions when using the chain-of-thought prompting for both GPT-3.5 and GPT-4 and when using the chain-of-thought prompting with coding language for GPT-4 overall. However, there were varying patterns in generating multistep answers, with differences in final answer and intermediate step accuracy. An interesting finding was that the chain-of-thought prompting with coding language for GPT-4 significantly outperformed the other approaches in generating correct final answers while demonstrating moderate accuracy in generating multistep answers correctly. In addition, through qualitative analysis, we identified domain-specific prompting patterns across the three STEM subjects. We then discussed how our findings aligned with, contradicted, and contributed to the current body of knowledge on automatic item generation research using LLMs, and the implications for teachers using LLMs to generate STEM assessment items.
Date Issued
2025
Publisher
Elsevier
Journal
Computers and Education: Artificial Intelligence
DOI
10.1016/j.caeai.2024.100344
Description
The open access publication is available at https://doi.org/10.1016/j.caeai.2024.100344