Options
Realizing visual question answering for education: GPT-4V as a multimodal AI
Loading...
Type
Article
Citation
Lee, G., & Zhai, X. (2025). Realizing visual question answering for education: GPT-4V as a multimodal AI. TechTrends. Advance online publication. https://doi.org/10.1007/s11528-024-01035-z
Abstract
Educators and researchers have analyzed various image data acquired from teaching and learning, such as images of learning materials, classroom dynamics, students’ drawings, etc. However, this approach is labour-intensive and time-consuming, limiting its scalability and efficiency. The recent development in the Visual Question Answering (VQA) technique has streamlined this process by allowing users to posing questions about the images and receive accurate and automatic answers, both in natural language, thereby enhancing efficiency and reducing the time required for analysis. State-of-the-art Vision Language Models (VLMs) such as GPT-4V(ision) have extended the applications of VQA to a wide range of educational purposes. This report employs GPT-4V as an example to demonstrate the potential of VLM in enabling and advancing VQA for education. Specifically, we demonstrated that GPT-4V enables VQA for educational scholars without requiring technical expertise, thereby reducing accessibility barriers for general users. In addition, we contend that GPT-4V spotlights the transformative potential of VQA for educational research, representing a milestone accomplishment for visual data analysis in education.
Publisher
Springer
Journal
TechTrends
Grant ID
2101104
R305C240010
Funding Agency
National Science Foundation
Institute of Education Sciences