CVPR 2026
BioCoach Logo

From 3D Pose to Prose: Biomechanics-Grounded
Vision-Language Coaching

* Corresponding author
1Dept. of Computer Science, Drexel University 2Dept. of Computer Science and Engineering, Michigan State University

Demo

BioCoach delivers real-time, biomechanically-grounded coaching feedback from fitness video, providing precise joint angle measurements and actionable corrections.

BioCoach Framework Teaser

BioCoach bridges biomechanical analysis and multimodal understanding for fitness coaching. Unlike pixel-only VLMs that produce generic advice, BioCoach constructs explicit, interpretable intermediate representations exposing kinematic properties to the language model, enabling phase-aligned, anatomically-specific feedback grounded in biomechanical principles.

Abstract

We present BioCoach, a biomechanics-grounded vision-language framework for fitness coaching from fitness video. BioCoach fuses visual appearance and 3D skeletal kinematics through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision-biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.

Key Results

+262.8%
METEOR improvement on QEVD-bio-fit-coach
+89.5%
LLM-Bio-Accuracy improvement
+179.6%
ROUGE-L improvement on QEVD-bio-fit-coach
3-Stage
Interpretable biomechanics-grounded pipeline

Method Overview

BioCoach converts fitness videos into biomechanically-grounded coaching feedback through explicit, interpretable intermediate representations that bridge kinematic data and language generation. The framework extracts two complementary modalities—visual appearance and 3D skeletal kinematics—and processes them through a three-stage pipeline: (1) an Exercise-Specific DoF Selection Module that identifies anatomically salient joints; (2) a Structured Biomechanical Context Generation Module that analyzes motion quality while accounting for individual body geometry; and (3) a Vision-Biomechanics Conditioned Feedback Generation module that produces coaching grounded in explicit biomechanical analysis.

Results

Performance on QEVD-bio-fit-coach

Evaluation on the newly created benchmark with fine-grained biomechanical ground-truth feedback annotations.

Method METEOR ↑ ROUGE-L ↑ BERTScore ↑ LLM-Acc. ↑ LLM-Bio-Acc. ↑ T-F-Score ↑
Stream-VLM (NeurIPS '24) 0.086 0.108 0.852 1.86 1.72 0.530
BioCoach (Ours) 0.312 (+262.8%) 0.302 (+179.6%) 0.877 (+2.9%) 3.12 (+67.7%) 3.26 (+89.5%) 0.544 (+2.6%)

Performance on QEVD-fit-coach (Original Annotations)

Performance comparison with state-of-the-art methods using original feedback annotations.

Method METEOR ↑ ROUGE-L ↑ BERTScore ↑ LLM-Acc. ↑ T-F-Score ↑
Zero-shot Models
InstructBLIP 0.047 0.040 0.839 1.56 -
Video-LLaVA 0.057 0.025 0.847 2.16 -
Video-ChatGPT 0.098 0.078 0.850 1.91 -
Video-LLaMA 0.101 0.077 0.859 1.29 -
LLaMA-VID 0.100 0.079 0.859 2.20 -
LLaVA-NeXT 0.104 0.078 0.858 2.27 -
Fine-tuned Models
Socratic-LLaMA-2-7B 0.094 0.071 0.860 2.17 0.50
Video-ChatGPT (FT) 0.108 0.093 0.863 2.33 0.50
LLaMA-VID (FT) 0.106 0.090 0.860 2.30 0.50
Stream-VLM (NeurIPS '24) 0.127 0.112 0.863 2.45 0.56
BioCoach (Ours) 0.129 (+1.6%) 0.122 (+8.9%) 0.864 (+0.1%) 2.56 (+4.5%) 0.544

BibTeX

@inproceedings{ji2026biocoach,
  title={From 3D Pose to Prose: Biomechanics-Grounded Vision-Language Coaching},
  author={Ji, Yuyang and Shen, Yixuan and Zhu, Shengjie and Kong, Yu and Liu, Feng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}