Can Multimodal LLMs "See" Science Instruction?
Benchmarking Pedagogical Reasoning in K-12 Classroom Videos

* Equal contribution   + Corresponding author
1Drexel University 2Washington State University 3Beijing Normal University 4UNC Chapel Hill 5City University of Hong Kong
Visual Context Impact on CIP Classification

Figure 1: MLLMs achieve different accuracy across CIP categories and show improvement with visual context. Visual information provides essential contextual cues for understanding classroom instructional practices, particularly in categories requiring observation of physical interactions and demonstrations.

Abstract

While Multimodal Large Language Models (MLLMs) have shown promise in video understanding, their ability to comprehend complex pedagogical practices in classroom settings remains unexplored. We present SciIBI (Science Instructional Benchmarking for Instruction), a benchmark for evaluating MLLMs on Classroom Instructional Practice (CIP) classification aligned with the Next Generation Science Standards (NGSS). SciIBI comprises 113 video clips from authentic K-12 science classrooms, each annotated with one of seven CIP categories representing core instructional strategies. We evaluate eight state-of-the-art MLLMs using various input modalities and prompting strategies. Our findings reveal that: (1) current MLLMs struggle with pedagogical reasoning, with the best model achieving only 53.6% accuracy; (2) visual context provides a +4.8% average improvement over transcript-only inputs; and (3) certain instructional practices like "Developing Models" remain particularly challenging. SciIBI provides a foundation for developing AI systems capable of supporting science education research and teacher professional development.

Key Findings

113
NGSS-Aligned Video Clips
8
State-of-the-Art MLLMs Evaluated
53.6%
Best Model Accuracy (InternVL3-78B)
+4.8%
Avg. Improvement with Visual Context

Benchmark Overview

SciIBI is designed to evaluate MLLMs' understanding of science instructional practices in authentic K-12 classroom settings. The benchmark focuses on seven Classroom Instructional Practice (CIP) categories derived from the NGSS framework:

  • Asking Questions: Teacher or students pose questions to guide inquiry
  • Developing Models: Creating visual or physical representations of concepts
  • Planning Investigations: Designing experiments and data collection methods
  • Analyzing Data: Interpreting results and identifying patterns
  • Mathematical Thinking: Using quantitative reasoning and calculations
  • Constructing Explanations: Building evidence-based scientific explanations
  • Engaging in Argument: Debating and defending scientific claims
CIP Category Distribution

Figure 2: Distribution of CIP categories in the SciIBI benchmark.

Method

Input Modalities

We evaluate MLLMs using three input configurations to understand the contribution of different modalities:

  • Transcript Only (T): Text transcripts of classroom dialogue
  • Video Only (V): Visual frames sampled from video clips
  • Video + Transcript (V+T): Combined multimodal input

Prompting Strategies

We employ multiple prompting approaches to comprehensively assess model capabilities:

  • Zero-shot: Direct classification without examples
  • Zero-shot with CoT: Chain-of-thought reasoning for pedagogical analysis
  • Few-shot: In-context examples of each CIP category
Evaluation Framework

Figure 3: Our evaluation framework for assessing MLLMs on pedagogical reasoning.

Results

Model Transcript (T) Video+Transcript (V+T)

Table 1: Classification accuracy across different models and input modalities using zero-shot prompting.

Key Observation: All models show consistent improvement when combining visual and textual information (V+T), with an average gain of +4.8% over transcript-only baselines. This highlights the importance of visual context in understanding classroom instructional practices.

BibTeX

@article{shen2026sciibi,
  title={Can Multimodal LLMs "See" Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos},
  author={Shen, Yixuan and He, Peng and Liu, Honglu and Ji, Yuyang and Li, Tingting and Chen, Tianlong and Xu, Kaidi and Liu, Feng},
  journal={arXiv preprint},
  year={2026}
}