CVPR 2026

From 3D Pose to Prose: Biomechanics-Grounded
Vision-Language Coaching

Yuyang Ji¹, Yixuan Shen¹, Shengjie Zhu², Yu Kong², Feng Liu^1*

* Corresponding author

¹Dept. of Computer Science, Drexel University ²Dept. of Computer Science and Engineering, Michigan State University

Demo

BioCoach delivers real-time, biomechanically-grounded coaching feedback from fitness video, providing precise joint angle measurements and actionable corrections.

BioCoach bridges biomechanical analysis and multimodal understanding for fitness coaching. Unlike pixel-only VLMs that produce generic advice, BioCoach constructs explicit, interpretable intermediate representations exposing kinematic properties to the language model, enabling phase-aligned, anatomically-specific feedback grounded in biomechanical principles.

Abstract

We present BioCoach, a biomechanics-grounded vision-language framework for fitness coaching from fitness video. BioCoach fuses visual appearance and 3D skeletal kinematics through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision-biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.

Key Results

+262.8%

METEOR improvement on QEVD-bio-fit-coach

+89.5%

LLM-Bio-Accuracy improvement

+179.6%

ROUGE-L improvement on QEVD-bio-fit-coach

3-Stage

Interpretable biomechanics-grounded pipeline

Method Overview

BioCoach converts fitness videos into biomechanically-grounded coaching feedback through explicit, interpretable intermediate representations that bridge kinematic data and language generation. The framework extracts two complementary modalities—visual appearance and 3D skeletal kinematics—and processes them through a three-stage pipeline: (1) an Exercise-Specific DoF Selection Module that identifies anatomically salient joints; (2) a Structured Biomechanical Context Generation Module that analyzes motion quality while accounting for individual body geometry; and (3) a Vision-Biomechanics Conditioned Feedback Generation module that produces coaching grounded in explicit biomechanical analysis.

Results

Performance on QEVD-bio-fit-coach

Evaluation on the newly created benchmark with fine-grained biomechanical ground-truth feedback annotations.

Method	METEOR ↑	ROUGE-L ↑	BERTScore ↑	LLM-Acc. ↑	LLM-Bio-Acc. ↑	T-F-Score ↑
Stream-VLM (NeurIPS '24)	0.086	0.108	0.852	1.86	1.72	0.530
BioCoach (Ours)	0.312 (+262.8%)	0.302 (+179.6%)	0.877 (+2.9%)	3.12 (+67.7%)	3.26 (+89.5%)	0.544 (+2.6%)

Performance on QEVD-fit-coach (Original Annotations)

Performance comparison with state-of-the-art methods using original feedback annotations.

Method	METEOR ↑	ROUGE-L ↑	BERTScore ↑	LLM-Acc. ↑	T-F-Score ↑
Zero-shot Models
InstructBLIP	0.047	0.040	0.839	1.56	-
Video-LLaVA	0.057	0.025	0.847	2.16	-
Video-ChatGPT	0.098	0.078	0.850	1.91	-
Video-LLaMA	0.101	0.077	0.859	1.29	-
LLaMA-VID	0.100	0.079	0.859	2.20	-
LLaVA-NeXT	0.104	0.078	0.858	2.27	-
Fine-tuned Models
Socratic-LLaMA-2-7B	0.094	0.071	0.860	2.17	0.50
Video-ChatGPT (FT)	0.108	0.093	0.863	2.33	0.50
LLaMA-VID (FT)	0.106	0.090	0.860	2.30	0.50
Stream-VLM (NeurIPS '24)	0.127	0.112	0.863	2.45	0.56
BioCoach (Ours)	0.129 (+1.6%)	0.122 (+8.9%)	0.864 (+0.1%)	2.56 (+4.5%)	0.544

BibTeX

@inproceedings{ji2026biocoach,
  title={From 3D Pose to Prose: Biomechanics-Grounded Vision-Language Coaching},
  author={Ji, Yuyang and Shen, Yixuan and Zhu, Shengjie and Kong, Yu and Liu, Feng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

From 3D Pose to Prose: Biomechanics-GroundedVision-Language Coaching