Can Multimodal LLMs "See" Science Instruction?
Benchmarking Pedagogical Reasoning in K-12 Classroom Videos

* Equal contribution   + Corresponding author
1Drexel University 2Washington State University 3Beijing Normal University 4UNC Chapel Hill 5City University of Hong Kong
Visual Context Impact on CIP Classification

Figure 1: The impact of visual context on instructional practice coding. Left: Text-only models fail to detect student engagement (e.g., raising hands) from the transcript alone, misclassifying the clip as a lecture (Big Idea). Right: Multimodal models (Vision+Text) utilize visual cues to correctly identify "Eliciting Student Ideas" (D1), yielding an average accuracy improvement of 4.8% across evaluated MLLMs.

Abstract

K–12 science classrooms are rich sites of inquiry where students coordinate phenomena, evidence, and explanatory models through discourse; yet, the multimodal complexity of these interactions has made automated analysis elusive. Existing benchmarks for classroom discourse focus primarily on mathematics and rely solely on transcripts, overlooking the visual artifacts and model-based reasoning emphasized by the Next Generation Science Standards (NGSS). We address this gap with SciIBI (Science Inquiry-Based Instruction), the first video benchmark for analyzing science classroom discourse, featuring 113 NGSS-aligned clips annotated with Core Instructional Practices (CIP) and sophistication levels. By evaluating eight state-of-the-art LLMs and Multimodal LLMs, we reveal fundamental limitations: current models struggle to distinguish pedagogically similar practices, suggesting that CIP coding requires instructional reasoning beyond surface pattern matching. Furthermore, adding video input yields inconsistent gains across architectures. Crucially, our evidence-based evaluation reveals that models often succeed through surface shortcuts rather than genuine pedagogical understanding. These findings establish science classroom discourse as a challenging frontier for multimodal AI and point toward human-AI collaboration, where models retrieve evidence to accelerate expert review rather than replace it.

Key Findings

113
NGSS-Aligned Video Clips
4
Core Instructional Practices (BI, D1, D2, D3)
8
LLMs / MLLMs Evaluated
53.6%
Best Accuracy (InternVL3-78B, Text+Vision, Zero-shot)

Benchmark Overview

Core Instructional Practices (CIP)

SciIBI operationalizes science classroom discourse analysis using the Core Instructional Practices framework from Windschitl et al. (2012). The framework specifies four practices that support inquiry-based instruction, each with a performance progression from basic to ambitious enactments. Given a classroom clip (with transcript and optional multimodal inputs), a model predicts the dominant practice:

  • BI (Big Idea): selecting and framing big ideas as models that link unobservable processes to phenomena.
  • D1: eliciting students' initial ideas to adapt instruction.
  • D2: guiding students to use theories and models to make sense of observations from the inquiry activity.
  • D3: pressing students for evidence-based explanations that coordinate claims with evidence.

CIP Framework (Windschitl et al., 2012)

Sophistication increases from Level 1 (surface-level) to Level 4 (model-based inquiry). The binary sophistication probe groups Levels 1–2 as Low and Levels 3–4 as High.

Low (Levels 1–2) High (Levels 3–4)
Level 1 Level 2 Level 3 Level 4
BI Topics, vocabulary, "things." Students name, label, identify using correct vocabulary. Observable process. Focus on "what is changing" or how conditions affect an event. Explanatory model focus. Focus on unobservable processes/entities and relationships among concepts. Link to observable phenomena to develop explanatory models.
D1 Monitor and reteach. Check for "correct" conceptions; one-on-one tutoring or IRE pattern. Elicit initial understandings. Draw out students' hypotheses and questions about scientific ideas. Adapt to student ideas. Pose open-ended tasks or puzzling events. Use students' language to shape conversations.
D2 Focus on procedure. Describe procedures and experimental setups; downplay concepts. Discover/confirm ideas. "Proof of concept" activities; acquire accepted facts and laws. Link concepts across investigations. Seed new concepts; students derive explanatory language. Model-based inquiry. Use evolving models as reference before, during, and after inquiry.
D3 No press for explanation. No explanation required; "explain" means "justify." "What happened." Describe variables, group differences, trends, or observations. "How/partial why." Hypothesize and predict system behavior. Causal explanation. Use unobservables to construct causal stories; discuss "what counts" as evidence.

Table 1: CIP framework (Windschitl et al., 2012). Sophistication increases from Level 1 to Level 4.

Tasks

  • Primary task — 4-way CIP classification: predict the dominant practice from {BI, D1, D2, D3}.
  • Sophistication probe: a binary Low (Levels 1–2) vs High (Levels 3–4) label derived from the Windschitl progression; we also report a 4-level fine-grained variant.
  • Evidence requirement: for every prediction, models must provide verifiable supporting evidence — text evidence (a quoted span or sentence indices from the transcript) and/or temporal evidence (a timestamp interval referencing visual cues).

Dataset Statistics

The final benchmark comprises 113 clips (~3 hours of classroom video), segmented from 82 NGSS-aligned lessons (16.35 hours raw) sourced from 120 K–12 science YouTube channels. Clips range from 6 to 600 seconds (mean: 92s, median: 63s). All clips were consensus-coded by two researchers (a doctoral student and a faculty member) trained in the CIP framework. Label distribution reflects naturalistic variation: BI (n=24, 21%), D1 (n=36, 32%), D2 (n=16, 14%), D3 (n=37, 33%). Transcripts were generated with Whisper large-v3 and lightly corrected; timestamps were preserved to support evidence localization.

SciIBI benchmark construction

Figure 2: SciIBI benchmark construction. NGSS-aligned videos are temporally segmented and annotated by consensus. The 113 clips span four CIP categories and binary sophistication levels.

Method

Input Modalities

We compare two input configurations to isolate the contribution of visual information:

  • Text (T): transcript-only input.
  • Text + Vision (TV): transcript paired with sampled video frames (1 frame/second, capped at 30 frames per clip).

Prompting Strategies

All strategies share identical CIP definitions and require a structured JSON response with category, sophistication, and evidence fields. Outputs failing schema validation receive one retry; unparseable responses are excluded from accuracy computation.

  • Zero-shot: task instructions and CIP definitions only.
  • Few-shot: four in-context exemplars (one per category), held constant across models.
  • Chain-of-Thought (CoT): step-by-step reasoning before classification.

Models

We evaluate eight models spanning scales and modalities. Open-source: Mistral-7B, Llama-3.3-70B, GPT-OSS-20B, and InternVL3-78B (the only open-source MLLM in our set). Proprietary (API): GPT-4o, Claude Sonnet 4.5, Gemini-2.5-Pro, and Qwen3-VL-235B. Open-source models are deployed locally with 4-bit quantization (BitsAndBytes) for models exceeding 40B parameters, on NVIDIA L40S GPUs. All models use deterministic decoding (temperature=0.0) with max_new_tokens=1024.

Results

Classification Performance

Zero-shot accuracies range from 39.1% to 53.6%, substantially below the ~79% F1 reported for math discourse coding, confirming that CIP coding requires instructional-function judgments beyond surface lexical cues. CoT benefits larger models (GPT-4o: +5.5pp; Llama-70B: +3.6pp) but degrades smaller ones (Mistral-7B: −5.4pp). Few-shot yields modest, inconsistent gains (+1–3pp).

Model Size Mod Zero-shot Few-shot CoT
Acc F1 Acc F1 Acc F1
Proprietary (API)
GPT-4o T 45.443.6 45.542.9 50.948.4
GPT-4o TV 49.146.1 47.746.2 51.948.5
Claude Sonnet 4.5 T 43.841.2 45.543.8 46.444.4
Gemini-2.5-Pro T 39.338.8 42.040.3 42.941.3
Gemini-2.5-Pro TV 42.939.9 42.940.3 48.244.6
Qwen3-VL-235B 235B T 46.440.8 48.243.5 45.541.7
Qwen3-VL-235B 235B TV 45.542.3 47.343.2 49.143.3
Open-source (Local)
Mistral-7B 7B T 40.233.2 36.631.3 34.830.3
GPT-OSS-20B 20B T 39.137.4 37.735.5 36.635.6
Llama-3.3-70B 70B T 44.642.5 47.344.8 48.245.4
InternVL3-78B 78B T 46.443.6 47.344.0 47.345.3
InternVL3-78B 78B TV 53.647.8 50.945.3 50.946.2

Table 2: Classification performance on SciIBI. Accuracy (Acc) and Macro-F1 (%) across eight models using Text-only (T) and Text+Vision (TV) inputs under Zero-shot, Few-shot, and CoT prompting. Bold indicates best within each column.

Modality Ablation

Comparing text-only versus vision+text inputs, the best accuracy is 53.6% (InternVL3-78B, TV, zero-shot), a +7.1pp gain over text-only. However, gains are architecture-dependent: InternVL3 +7.1pp, GPT-4o +3.7pp, Gemini +3.6pp, while Qwen3-VL decreases by −0.9pp. Isolating vision within CoT yields +5.3pp for Gemini and +1.0pp for GPT-4o.

Model Zero-shot Few-shot CoT
Acc Δ Acc Δ Acc Δ
InternVL3-78B 53.6+7.1 50.9+4.5 50.9+4.5
GPT-4o 49.1+3.7 47.7+2.3 51.9+6.5
Gemini-2.5-Pro 42.9+3.6 42.9+3.6 48.2+8.9
Qwen3-VL-235B 45.5−0.9 47.3+0.9 49.1+2.7

Table 3: Impact of visual modality. Δ is the change from each model's text-only zero-shot baseline.

Sophistication Probe

InternVL3-78B (Text+Vision) achieves 39.1% overall (weighted) on the fine-grained 4-level sophistication task and 69.1% on the binary Low/High probe, with predictions skewing toward higher sophistication levels.

Condition Metric BI D1 D2 D3
Ignore L1 4-Level 50.028.650.037.8
Binary 66.754.385.778.4
Require L1 4-Level 45.534.850.033.3
Binary 45.552.275.085.7
N Ignore L1 24351437
Require L1 1123421

Table 4: Sophistication level prediction (InternVL3-78B, Text+Vision). Accuracy (%) for fine-grained (4-Level) and binary (Low/High) tasks.

Evidence Quality Analysis

We designed an Evidence Quality Score (EQS) with three dimensions (1–3 scale): Alignment, Sufficiency, and Specificity. Two raters independently scored 60 model outputs (inter-rater κ: 0.73–0.92); disagreements were resolved by discussion. GPT-4o achieves higher EQS (2.67) than InternVL3 (2.40) despite lower accuracy, revealing that accuracy and evidence quality can diverge — correct predictions may cite superficial evidence (lexical shortcuts), while incorrect predictions sometimes provide well-grounded reasoning for genuinely ambiguous cases.

Model Alignment Sufficiency Specificity Mean EQS
InternVL3-78B 2.532.022.652.40
GPT-4o 2.672.692.642.67

Table 5: Evidence Quality Scores (EQS, 1–3 scale). Higher accuracy does not always correlate with higher-quality reasoning.

Failure Analysis

The aggregated confusion matrix (zero-shot, text-only) reveals two dominant patterns: D3→BI (113 cases) and D1→D3 (84 cases). Both are caused by models matching surface linguistic features without distinguishing the underlying pedagogical function, suggesting that current LLMs cannot reliably separate what is said from why it is said.

Failure analysis: confusion matrix and representative errors

Figure 3: Failure analysis of text-only models. (a) Aggregated confusion matrix reveals systematic D3→BI and D1→D3 confusions. (b) Representative errors show models rely on surface keywords rather than pedagogical function.

Takeaways: (1) Visual input yields modest, architecture-dependent gains; not all MLLMs effectively integrate visual information for pedagogical reasoning. (2) Chain-of-thought helps large models but degrades smaller ones, suggesting step-by-step reasoning requires sufficient model capacity. (3) Evidence-based evaluation reveals cases where models succeed through lexical shortcuts rather than genuine understanding. We envision human-AI collaboration through evidence-first interfaces where model outputs serve as searchable annotations that augment expert judgment rather than replacing it.

BibTeX

@inproceedings{shen2026sciibi,
  title={Can Multimodal LLMs "See" Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos},
  author={Shen, Yixuan and He, Peng and Liu, Honglu and Fan, Jinxuan and Ji, Yuyang and Li, Tingting and Chen, Tianlong and Xu, Kaidi and Liu, Feng},
  booktitle={International Conference on Artificial Intelligence in Education (AIED)},
  year={2026}
}