While Multimodal Large Language Models (MLLMs) have shown promise in video understanding, their ability to comprehend complex pedagogical practices in classroom settings remains unexplored. We present SciIBI (Science Instructional Benchmarking for Instruction), a benchmark for evaluating MLLMs on Classroom Instructional Practice (CIP) classification aligned with the Next Generation Science Standards (NGSS). SciIBI comprises 113 video clips from authentic K-12 science classrooms, each annotated with one of seven CIP categories representing core instructional strategies. We evaluate eight state-of-the-art MLLMs using various input modalities and prompting strategies. Our findings reveal that: (1) current MLLMs struggle with pedagogical reasoning, with the best model achieving only 53.6% accuracy; (2) visual context provides a +4.8% average improvement over transcript-only inputs; and (3) certain instructional practices like "Developing Models" remain particularly challenging. SciIBI provides a foundation for developing AI systems capable of supporting science education research and teacher professional development.
SciIBI is designed to evaluate MLLMs' understanding of science instructional practices in authentic K-12 classroom settings. The benchmark focuses on seven Classroom Instructional Practice (CIP) categories derived from the NGSS framework:
Figure 2: Distribution of CIP categories in the SciIBI benchmark.
We evaluate MLLMs using three input configurations to understand the contribution of different modalities:
We employ multiple prompting approaches to comprehensively assess model capabilities:
Figure 3: Our evaluation framework for assessing MLLMs on pedagogical reasoning.
| Model | Transcript (T) | Video+Transcript (V+T) |
|---|
Table 1: Classification accuracy across different models and input modalities using zero-shot prompting.
Key Observation: All models show consistent improvement when combining visual and textual information (V+T), with an average gain of +4.8% over transcript-only baselines. This highlights the importance of visual context in understanding classroom instructional practices.
@article{shen2026sciibi,
title={Can Multimodal LLMs "See" Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos},
author={Shen, Yixuan and He, Peng and Liu, Honglu and Ji, Yuyang and Li, Tingting and Chen, Tianlong and Xu, Kaidi and Liu, Feng},
journal={arXiv preprint},
year={2026}
}