DrawSim-PD: Simulating Student Science Drawings

Abstract

Developing expertise in diagnostic reasoning requires practice with diverse student artifacts, yet privacy regulations prohibit sharing authentic student work for teacher professional development (PD) at scale. We present DrawSim-PD, the first generative framework that simulates NGSS-aligned, student-like science drawings exhibiting controllable pedagogical imperfections to support teacher training. Central to our approach are capability profiles—structured cognitive states encoding what students at each performance level can and cannot yet demonstrate. These profiles ensure cross-modal coherence across generated outputs: (i) a student-like drawing, (ii) a first-person reasoning narrative, and (iii) a teacher-facing diagnostic concept map. Using 100 curated NGSS topics spanning K-12, we construct a corpus of 10,000 systematically structured artifacts. Through an expert-based feasibility evaluation, K-12 science educators verified the artifacts' alignment with NGSS expectations (>84% positive on core items) and utility for interpreting student thinking, while identifying refinement opportunities for grade-band extremes. We release this open infrastructure to overcome data scarcity barriers in visual assessment research.

Our contributions

We introduce a capability-profile mechanism that enables the generation of student-like drawings with systematically varied misconceptions, achieving controllable pedagogical imperfection aligned to curriculum standards.
We devise an automated diagnostic scaffolding module that transforms visual artifacts into structured teacher supports via generated diagnostic concept maps.
We release a 10,000-artifact corpus with structured metadata as open research infrastructure, representing the largest collection of curriculum-aligned student drawing simulations to date.
We validate the system's pedagogical fidelity through an expert feasibility study, where experienced educators confirmed that the generated outputs are NGSS-aligned and pedagogically authentic.

DrawSim-PD Framework

We present DrawSim-PD, a generative framework that simulates student-like science drawings accompanied by reasoning narratives and teacher-facing diagnostic concept maps. The framework addresses two core challenges: (1) producing synthetic artifacts that maintain both scientific validity and developmental authenticity (the "inverse problem" of generating specific errors), and (2) providing diagnostic scaffolding to support teacher interpretation and calibration activities.

Framework overview

DrawSim-PD comprises three integrated modules coordinated through shared capability profiles. The framework takes as input an NGSS performance expectation, a target grade level, and a desired performance level. It generates three outputs:

A first-person reasoning narrative simulating the student's internal monologue and scientific vocabulary.
A hand-drawn style scientific illustration reflecting grade-appropriate motor skills and specific, realistic misconceptions.
A structured diagnostic concept map linking visual observations to underlying understanding, serving as an answer key for teacher diagnosis.

DrawSim-PD framework modules diagram — The DrawSim-PD framework comprises three modules: (1) NGSS-Aligned Student Simulator, (2) Drawing-Centric Synthesis, and (3) Diagnostic Concept Mapping.

Challenges and approach

Simulating pedagogically valid student drawings requires addressing three interconnected challenges:

Controllable misconceptions (Challenge 1): Visual misconceptions must manifest through specific spatial arrangements, missing elements, and incorrect relationships, requiring structured control rather than stochastic perturbation.
Cross-modal coherence (Challenge 2): A simulated student who "doesn't understand cyclical processes" must produce drawings lacking return arrows, narratives expressing confusion, and concept maps identifying this gap; independent generation risks hallucinating contradictory competencies. DrawSim-PD addresses this through unified generation conditioned on shared capability profiles.
Curriculum grounding (Challenge 3): Misconceptions must align with documented learning progressions, not arbitrary errors. DrawSim-PD addresses this through automated NGSS decomposition into evidence statements and capability profiles.

DrawSim-PD addresses these challenges through capability profiles as simulated cognitive states that ensure consistency across the reasoning narrative, the drawing, and the diagnostic concept map.

Examples

Representative student-like scientific illustrations from DrawSim-PD — Representative student-like scientific illustrations generated by DrawSim-PD across four performance levels (Emergent to Advanced) and three science domains (life sciences, physical sciences, earth and space sciences).

Results

Participants evaluated 480 unique artifacts, providing systematic coverage of the corpus. Results exceeded expectations for a first-generation system (see table below).

Expert Evaluation Instrument

The mixed-method survey used for the feasibility study, assessing strict NGSS alignment (Q1–Q5), developmental plausibility (Q6), and pedagogical utility (Q7–Q8).

NGSS Alignment (Yes / Partially / No)

Q1: Does the topic align with the NGSS Performance Expectation?
Q2: Does the drawing represent the disciplinary core ideas?
Q3: Does the drawing align logically with the given prompt?
Q4: Does the drawing align with the capability statements?
Q5: Does the drawing match the assigned performance level?

Grade-Band Plausibility (Yes / Partially / No)

Q6: Does the drawing appear plausible for the target grade band?

Component Quality (1–5 Likert)

Q7: Does the concept map represent the reasoning in the drawing?
Q8: Does the work maintain plausible scientific relationships for the level?

NGSS Alignment

Artifacts demonstrated strong alignment with NGSS standards. Topic–PE alignment achieved 89.6% full agreement (Q1), indicating successful mapping of standards to drawing tasks. Disciplinary Core Idea representation (Q2: 84.2%) and drawing–prompt coherence (Q3: 86.7%) confirmed that generated content captures required scientific concepts. Notably, explicit disagreements ("No") remained below 2.1% across these core alignment items.

Performance Differentiation

Alignment with capability statements (Q4: 75.0% Yes) and performance levels (Q5: 73.8% Yes) showed that combined positive responses (Yes + Partially) exceeded 92% for both items. The approximate 20% prevalence of "Partially" ratings is attributed to inherent ambiguity in boundary cases.

Grade-Band Plausibility

For Q6, 93.3% of artifacts were rated as plausible or partially plausible for simulating developmental characteristics, suggesting the system encodes key developmental constraints. Plausibility was highest for middle grades (3–8), with slight degradation at grade-band extremes (K–2 and 9–12); see below for details.

Component Quality

Diagnostic concept maps (Q7) received favorable ratings, with 77.5% rated 4 or 5. Scientific plausibility (Q8) showed higher variance, with 18.3% rated 1 or 2. We interpret this variance as a deliberate design tension: the system intentionally generates incorrect elements to simulate misconceptions, and some evaluators may have penalized these as scientific inaccuracies.

Expert Evaluation Results (N=480 evaluations)

Dimension	Yes	Partially	No	Likert 1	Likert 2	Likert 3	Likert 4	Likert 5
Q1: Topic-PE Alignment	89.58%	8.75%	1.67%	—	—	—	—	—
Q2: DCI Representation	84.17%	13.75%	2.08%	—	—	—	—	—
Q3: Drawing-Prompt Coherence	86.66%	12.92%	0.42%	—	—	—	—	—
Q4: Capability Statement Match	75.00%	24.17%	0.83%	—	—	—	—	—
Q5: Performance Level Match	73.75%	19.17%	7.08%	—	—	—	—	—
Q6: Grade-Level Authenticity	60.42%	32.92%	6.66%	—	—	—	—	—
Q7: Concept Map Quality	—	—	—	0.0%	3.3%	19.2%	47.9%	29.6%
Q8: Scientific Accuracy	—	—	—	5.0%	13.3%	19.6%	32.1%	30.0%

To complement expert judgments, we examined semantic alignment between generated components using CLIP similarity scores across 1,200 sampled artifacts.

Cross-modal consistency (CLIP similarity, N=1,200)

	Text-Draw	CMap-Draw	Text-CMap	Overall
Overall	0.356	0.606	0.273	0.412
By Level
Emergent	0.362	0.589	0.250	0.400
Developing	0.360	0.607	0.266	0.411
Proficient	0.354	0.614	0.280	0.416
Advanced	0.349	0.614	0.293	0.419
By Grade
K-2	0.364	0.583	0.263	0.403
9-12	0.348	0.629	0.286	0.421

CLIP similarity serves as an engineering consistency diagnostic, not a proxy for educational validity; unrelated image–text pairs typically score below 0.15. Concept Map–Drawing alignment achieved the highest consistency (0.606), indicating that structured maps effectively capture visual content. Text–Drawing consistency decreased slightly with grade level (0.364 to 0.348) and performance level (0.362 to 0.349). We interpret this pattern not as system degradation but as reflecting a fundamental characteristic of science assessment: as reasoning becomes more advanced (Level 4) and abstract (Grades 9–12), it becomes increasingly difficult to represent via static visual depictions.

Teacher Perspectives on Utility

We conducted a qualitative thematic analysis using inductive techniques to examine how expert participants perceived the generated corpus. To ensure rigor, two authors collaboratively coded the open-ended responses and refined categories through iterative discussion, following established guidelines for reliability in qualitative research. Five key themes emerged regarding system utility, authenticity, and design implications.

Plausibility Crossed the Authenticity Threshold

Participants consistently described outputs as immediately recognizable as student-like due to their simplicity and visual style. P6 noted, This looks like the work of a real student, expressing confidence that the system could reflect classroom realities. However, authenticity was sometimes compromised by an uncanny valley of neatness: P1 explained, The composition is too clear… the student's arrangement would be more chaotic. Participants stressed that true authenticity requires balancing scientific errors with stylistic imperfections, noting that when the drawing skill level matches the grade bands, it is more like real student thinking (P5).

Grade-level Differentiation Succeeded for Core Grades

The system's strength lay in stratifying work by grade level (particularly grades 3–8). Challenges in prompt engineering and participant observations (P6, P5) highlighted underestimation of lower-grade students' capabilities; high school examples often relied on scientific errors for differentiation. Experts recommended tighter constraints in capability profiles for K–2 motor skills and 9–12 abstract reasoning.

Misconceptions were Pedagogically Relevant but "Mechanical"

Artifacts reflected classroom misunderstandings—superficial reasoning, missing causal links—that teachers found useful. P2 distinguished: AI's misconceptions appear mechanical and rigid… students' misconceptions are slightly more flexible. This finding is important for AIED researchers: the system captures what students get wrong but not fully how they get it wrong (the fluid, tentative nature of student thinking).

Cross-modal Integration Enhanced Credibility

Teachers valued the combined presentation of drawings, narratives, and concept maps; this multimodal structure made outputs more credible and easier to use as teaching artifacts. P4 described the components as generally consistent, and P6 found no significant discrepancies. Although some mismatches were noted (e.g., concept map vocabulary absent from the drawing), the overall consensus was that the diagnostic map successfully externalized the reasoning implied by the drawing.

Utility Depends on Transparency and Diversity

Participants identified strong potential for professional development and appreciated that artifacts articulate what students can and cannot do (P5). Experts recommended moving beyond black-box generation to maximize classroom relevance. P5 suggested adding a module to explain the logic of level determination, arguing that transparency in classification (e.g., why a drawing was Developing) would help teachers build diagnostic criteria. This feedback points toward a design shift from purely generative tools to explainable diagnostic trainers.

Conclusion

We presented DrawSim-PD, a generative framework that simulates student-like science drawings to support NGSS-aligned teacher diagnostic reasoning. By inverting standard generation objectives to prioritize pedagogical imperfection over aesthetic accuracy, we successfully modeled the partial understandings and spatial errors characteristic of developing learners. Central to our approach are capability profiles that encode performance constraints, ensuring cross-modal coherence between drawings, narratives, and diagnostic concept maps. In an expert-based feasibility study, K-12 science educators verified that the generated artifacts are well-aligned with NGSS expectations and appropriately differentiated by performance level, while identifying opportunities for refining grade-band extremes. The accompanying corpus of 10,000 artifacts provides open research infrastructure for calibration activities and visual assessment research previously constrained by privacy regulations. Beyond validation, we see potential for adaptive calibration systems targeting individual teachers' diagnostic blind spots and interactive generation for on-demand misconception specification.

BibTeX

@article{chakma2026drawsim,
  title={DrawSim-PD: Simulating Student Science Drawings to Support NGSS-Aligned Teacher Diagnostic Reasoning},
  author={Chakma, Arijit and He, Peng and Wang, Zeyuan and Liu, Honglu and Li, Tingting and Do, Tiffany D and Liu, Feng},
  booktitle={In Proceeding of International Conference on AI in Education (AIED)},
  year={2026}
}