BioCoach delivers real-time, biomechanically-grounded coaching feedback from fitness video, providing precise joint angle measurements and actionable corrections.
We present BioCoach, a biomechanics-grounded vision-language framework for fitness coaching from fitness video. BioCoach fuses visual appearance and 3D skeletal kinematics through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision-biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.
BioCoach converts fitness videos into biomechanically-grounded coaching feedback through explicit, interpretable intermediate representations that bridge kinematic data and language generation. The framework extracts two complementary modalities—visual appearance and 3D skeletal kinematics—and processes them through a three-stage pipeline: (1) an Exercise-Specific DoF Selection Module that identifies anatomically salient joints; (2) a Structured Biomechanical Context Generation Module that analyzes motion quality while accounting for individual body geometry; and (3) a Vision-Biomechanics Conditioned Feedback Generation module that produces coaching grounded in explicit biomechanical analysis.
Evaluation on the newly created benchmark with fine-grained biomechanical ground-truth feedback annotations.
| Method | METEOR ↑ | ROUGE-L ↑ | BERTScore ↑ | LLM-Acc. ↑ | LLM-Bio-Acc. ↑ | T-F-Score ↑ |
|---|---|---|---|---|---|---|
| Stream-VLM (NeurIPS '24) | 0.086 | 0.108 | 0.852 | 1.86 | 1.72 | 0.530 |
| BioCoach (Ours) | 0.312 (+262.8%) | 0.302 (+179.6%) | 0.877 (+2.9%) | 3.12 (+67.7%) | 3.26 (+89.5%) | 0.544 (+2.6%) |
Performance comparison with state-of-the-art methods using original feedback annotations.
| Method | METEOR ↑ | ROUGE-L ↑ | BERTScore ↑ | LLM-Acc. ↑ | T-F-Score ↑ |
|---|---|---|---|---|---|
| Zero-shot Models | |||||
| InstructBLIP | 0.047 | 0.040 | 0.839 | 1.56 | - |
| Video-LLaVA | 0.057 | 0.025 | 0.847 | 2.16 | - |
| Video-ChatGPT | 0.098 | 0.078 | 0.850 | 1.91 | - |
| Video-LLaMA | 0.101 | 0.077 | 0.859 | 1.29 | - |
| LLaMA-VID | 0.100 | 0.079 | 0.859 | 2.20 | - |
| LLaVA-NeXT | 0.104 | 0.078 | 0.858 | 2.27 | - |
| Fine-tuned Models | |||||
| Socratic-LLaMA-2-7B | 0.094 | 0.071 | 0.860 | 2.17 | 0.50 |
| Video-ChatGPT (FT) | 0.108 | 0.093 | 0.863 | 2.33 | 0.50 |
| LLaMA-VID (FT) | 0.106 | 0.090 | 0.860 | 2.30 | 0.50 |
| Stream-VLM (NeurIPS '24) | 0.127 | 0.112 | 0.863 | 2.45 | 0.56 |
| BioCoach (Ours) | 0.129 (+1.6%) | 0.122 (+8.9%) | 0.864 (+0.1%) | 2.56 (+4.5%) | 0.544 |
@inproceedings{ji2026biocoach,
title={From 3D Pose to Prose: Biomechanics-Grounded Vision-Language Coaching},
author={Ji, Yuyang and Shen, Yixuan and Zhu, Shengjie and Kong, Yu and Liu, Feng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}