Can Multimodal LLMs "See" Science Instruction?
Benchmarking Pedagogical Reasoning in K-12 Classroom Videos

Yixuan Shen^1*, Peng He^2*, Honglu Liu^2,3, Yuyang Ji¹, Tingting Li², Tianlong Chen⁴, Kaidi Xu⁵, Feng Liu¹⁺

* Equal contribution + Corresponding author

¹Drexel University ²Washington State University ³Beijing Normal University ⁴UNC Chapel Hill ⁵City University of Hong Kong

Paper arXiv Code Data

Visual Context Impact on CIP Classification

Figure 1: MLLMs achieve different accuracy across CIP categories and show improvement with visual context. Visual information provides essential contextual cues for understanding classroom instructional practices, particularly in categories requiring observation of physical interactions and demonstrations.

Abstract

While Multimodal Large Language Models (MLLMs) have shown promise in video understanding, their ability to comprehend complex pedagogical practices in classroom settings remains unexplored. We present SciIBI (Science Instructional Benchmarking for Instruction), a benchmark for evaluating MLLMs on Classroom Instructional Practice (CIP) classification aligned with the Next Generation Science Standards (NGSS). SciIBI comprises 113 video clips from authentic K-12 science classrooms, each annotated with one of seven CIP categories representing core instructional strategies. We evaluate eight state-of-the-art MLLMs using various input modalities and prompting strategies. Our findings reveal that: (1) current MLLMs struggle with pedagogical reasoning, with the best model achieving only 53.6% accuracy; (2) visual context provides a +4.8% average improvement over transcript-only inputs; and (3) certain instructional practices like "Developing Models" remain particularly challenging. SciIBI provides a foundation for developing AI systems capable of supporting science education research and teacher professional development.

Key Findings

113

NGSS-Aligned Video Clips

State-of-the-Art MLLMs Evaluated

53.6%

Best Model Accuracy (InternVL3-78B)

+4.8%

Avg. Improvement with Visual Context

Benchmark Overview

SciIBI is designed to evaluate MLLMs' understanding of science instructional practices in authentic K-12 classroom settings. The benchmark focuses on seven Classroom Instructional Practice (CIP) categories derived from the NGSS framework:

Asking Questions: Teacher or students pose questions to guide inquiry
Developing Models: Creating visual or physical representations of concepts
Planning Investigations: Designing experiments and data collection methods
Analyzing Data: Interpreting results and identifying patterns
Mathematical Thinking: Using quantitative reasoning and calculations
Constructing Explanations: Building evidence-based scientific explanations
Engaging in Argument: Debating and defending scientific claims

Figure 2: Distribution of CIP categories in the SciIBI benchmark.

Method

Input Modalities

We evaluate MLLMs using three input configurations to understand the contribution of different modalities:

Transcript Only (T): Text transcripts of classroom dialogue
Video Only (V): Visual frames sampled from video clips
Video + Transcript (V+T): Combined multimodal input

Prompting Strategies

We employ multiple prompting approaches to comprehensively assess model capabilities:

Zero-shot: Direct classification without examples
Zero-shot with CoT: Chain-of-thought reasoning for pedagogical analysis
Few-shot: In-context examples of each CIP category

Figure 3: Our evaluation framework for assessing MLLMs on pedagogical reasoning.

Results

Model	Transcript (T)	Video+Transcript (V+T)

Table 1: Classification accuracy across different models and input modalities using zero-shot prompting.

Key Observation: All models show consistent improvement when combining visual and textual information (V+T), with an average gain of +4.8% over transcript-only baselines. This highlights the importance of visual context in understanding classroom instructional practices.

BibTeX

@article{shen2026sciibi,
  title={Can Multimodal LLMs "See" Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos},
  author={Shen, Yixuan and He, Peng and Liu, Honglu and Ji, Yuyang and Li, Tingting and Chen, Tianlong and Xu, Kaidi and Liu, Feng},
  journal={arXiv preprint},
  year={2026}
}