Xiaohan Wang

Bio

I am a Postdoc at Stanford University, affiliated with MARVL and Stanford AI Lab. I am fortunate to work with Prof. Serena Yeung.

My research interests lie in Video Understanding, Multimodal Learning, and AI for Healthcare.

I received my Ph.D. from the University of Technology Sydney, where I was advised by Prof. Yi Yang. Prior to that, I obtained my B.E. from the University of Science and Technology of China. During my Ph.D. studies, I had the privilege to collaborate with researchers at Baidu Research and Facebook AI Research.

News

[2025/01]: Video-STAR and VidDiff are accepted to ICLR 2025.
[2025/01]: Release Temporal Preference Optimization (TPO), a video-centric post-training framework that enhances temporal grounding in long-form videos for Video-LMMs.
[2024/12]: Release Apollo, a comprehensive exploration of video understanding in large multimodal models.
[2024/10]: VLM Classifier is accepted to NeurIPS 2024.
[2024/07]: VideoAgent is accepted to ECCV 2024.
[2024/06]: Give a talk at "What is Next in Video Understanding" workshop @ CVPR 2024.
[2024/03]: Introduce VideoAgent, where we leverage a large language model as an agent for long-form video understanding.
[2024/03]: VisDiff is accepted as an oral presentation (90/11532) at CVPR 2024!
[2024/01]: RLCF is accepted by ICLR 2024.

Publications

Most recent publications on Google Scholar.
^‡ indicates equal contribution.

Selected
All

Temporal Preference Optimization for Long-Form Video Understanding

Rui Li*, Xiaohan Wang*, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

arXiv preprint (2025)

project paper code

@article{li2025temporal,
  title={Temporal Preference Optimization for Long-Form Video Understanding},
  author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Wang, Zeyu and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2501.13919},
  year={2025}
}

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia

arXiv preprint (2024)

project paper

@article{zohar2024apollo,
  title={Apollo: An Exploration of Video Understanding in Large Multimodal Models},
  author={Zohar, Orr and Wang, Xiaohan and Dubois, Yann and Mehta, Nikhil and Xiao, Tong and Hansen-Estruch, Philippe and Yu, Licheng and Wang, Xiaofang and Juefei-Xu, Felix and Zhang, Ning and Yeung-Levy, Serena and Xia, Xide},
  journal={arXiv preprint arXiv:2412.10360},
  year={2024}
}

Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning

Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

ICLR (2025)

project paper code

@article{zohar2024video,
  title={Video-star: Self-training enables video instruction tuning with any supervision},
  author={Zohar, Orr and Wang, Xiaohan and Bitton, Yonatan and Szpektor, Idan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2407.06189},
  year={2024}
}

Video Action Differencing

James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy

ICLR (2025)

project paper code

@article{zohar2024video,
  title={Video-star: Self-training enables video instruction tuning with any supervision},
  author={Zohar, Orr and Wang, Xiaohan and Bitton, Yonatan and Szpektor, Idan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2407.06189},
  year={2024}
}

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Mark Endo, Xiaohan Wang, Serena Yeung-Levy

arXiv preprint (2024)

project paper

@article{endo2024feather,
  title={Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration},
  author={Endo, Mark and Wang, Xiaohan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2412.13180},
  year={2024}
}

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy

ECCV (2024)

project paper code

@article{VideoAgent,
  title={VideoAgent: Long-form Video Understanding with Large Language Model as Agent},
  author={Wang, Xiaohan and Zhang, Yuhui and Zohar, Orr and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2403.10517},
  year={2024}
}

Why are Visually-Grounded Language Models Bad at Image Classification?

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy

NeurIPS (2024)

project paper code

@article{VLMClassifier,
  title={Why are Visually-Grounded Language Models Bad at Image Classification?},
  author={Zhang, Yuhui and Unell, Alyssa and Wang, Xiaohan and Ghosh, Dhruba and Su, Yuchang and Schmidt, Ludwig and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2405.18415},
  year={2024}
}

Describing Differences in Image Sets with Natural Language

Lisa Dunlap*, Yuhui Zhang*, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell*, Jacob Steinhardt*, Joseph E. Gonzalez*, Serena Yeung-Levy*

CVPR (2024) Oral (90/11532)

project paper code

@inproceedings{VisDiff,
  title={Describing Differences in Image Sets with Natural Language},
  author={Dunlap, Lisa and Zhang, Yuhui and Wang, Xiaohan and Zhong, Ruiqi and Darrell, Trevor and Steinhardt, Jacob and Gonzalez, Joseph E. and Yeung-Levy, Serena},
  booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

ICLR (2024)

project paper code

@inproceedings{
  zhao2024testtime,
  title={Test-Time Adaptation with {CLIP} Reward for Zero-Shot Generalization in Vision-Language Models},
  author={Shuai Zhao and Xiaohan Wang and Linchao Zhu and Yi Yang},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=kIP0duasBb}
}

LANA: A Language-Capable Navigator for Instruction Following and Generation

Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang

CVPR (2023)

paper code

@inproceedings{wang2023lana,
  title={Lana: A language-capable navigator for instruction following and generation},
  author={Wang, Xiaohan and Wang, Wenguan and Shao, Jiayi and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={19048--19058},
  year={2023}
}

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

CVPR (2023)

paper code

@inproceedings{bike,
  title={Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models},
  author={Wu, Wenhao and Wang, Xiaohan and Luo, Haipeng and Wang, Jingdong and Yang, Yi and Ouyang, Wanli},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2023}
}

Gloss-Free End-to-End Sign Language Translation

Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang

ACL (2023) Oral

paper code

@inproceedings{lin2023gloss,
  title={Gloss-Free End-to-End Sign Language Translation},
  author={Lin, Kezhou and Wang, Xiaohan and Zhu, Linchao and Sun, Ke and Yang, Yi and others},
  booktitle={The 61st Annual Meeting Of The Association For Computational Linguistics},
  year={2023}
}

Action Sensitivity Learning for Temporal Action Localization

Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang

ICCV (2023)

paper code

@InProceedings{Shao_2023_ICCV,
    author    = {Shao, Jiayi and Wang, Xiaohan and Quan, Ruijie and Zheng, Junjun and Yang, Jiang and Yang, Yi},
    title     = {Action Sensitivity Learning for Temporal Action Localization},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {13457-13469}
}

Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark

Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang

CVPR (2022)

paper code

@inproceedings{miao2022large,
  title={Large-scale Video Panoptic Segmentation in the Wild: A Benchmark},
  author={Miao, Jiaxu and Wang, Xiaohan and  Wu, Yu and Li, Wei and Zhang, Xu and Wei, Yunchao and Yang, Yi},
  booktitle={Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

Interactive Prototype Learning for Egocentric Action Recognition

Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang

ICCV (2021)

paper

@inproceedings{wang2021interactive,
  title={Interactive prototype learning for egocentric action recognition},
  author={Wang, Xiaohan and Zhu, Linchao and Wang, Heng and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={8168--8177},
  year={2021}
}

Symbiotic Attention for Egocentric Action Recognition with Object-centric Alignment

Xiaohan Wang, Linchao Zhu, Yu Wu, Yi Yang

T-PAMI (2021)

paper code

@article{wang2020symbiotic,
  title={Symbiotic attention for egocentric action recognition with object-centric alignment},
  author={Wang, Xiaohan and Zhu, Linchao and Wu, Yu and Yang, Yi},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  volume={45},
  number={6},
  pages={6605--6617},
  year={2020},
  publisher={IEEE}
}

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

Xiaohan Wang, Linchao Zhu, Yi Yang

CVPR (2021)

paper code

@inproceedings{wang2021t2vlad,
  title={T2vlad: global-local sequence alignment for text-video retrieval},
  author={Wang, Xiaohan and Zhu, Linchao and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={5079--5088},
  year={2021}
}

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

AAAI (2020) Oral

paper code

@inproceedings{wang2020symbiotic,
  title={Symbiotic attention with privileged information for egocentric action recognition},
  author={Wang, Xiaohan and Wu, Yu and Zhu, Linchao and Yang, Yi},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={34},
  number={07},
  pages={12249--12256},
  year={2020}
}

Temporal Preference Optimization for Long-Form Video Understanding

Rui Li*, Xiaohan Wang*, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

arXiv preprint (2025)

project paper code

@article{li2025temporal,
  title={Temporal Preference Optimization for Long-Form Video Understanding},
  author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Wang, Zeyu and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2501.13919},
  year={2025}
}

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia

arXiv preprint (2024)

project paper

@article{zohar2024apollo,
  title={Apollo: An Exploration of Video Understanding in Large Multimodal Models},
  author={Zohar, Orr and Wang, Xiaohan and Dubois, Yann and Mehta, Nikhil and Xiao, Tong and Hansen-Estruch, Philippe and Yu, Licheng and Wang, Xiaofang and Juefei-Xu, Felix and Zhang, Ning and Yeung-Levy, Serena and Xia, Xide},
  journal={arXiv preprint arXiv:2412.10360},
  year={2024}
}

Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning

Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

ICLR (2025)

project paper code

@article{zohar2024video,
  title={Video-star: Self-training enables video instruction tuning with any supervision},
  author={Zohar, Orr and Wang, Xiaohan and Bitton, Yonatan and Szpektor, Idan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2407.06189},
  year={2024}
}

Video Action Differencing

James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy

ICLR (2025)

project paper code

@article{zohar2024video,
  title={Video-star: Self-training enables video instruction tuning with any supervision},
  author={Zohar, Orr and Wang, Xiaohan and Bitton, Yonatan and Szpektor, Idan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2407.06189},
  year={2024}
}

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Mark Endo, Xiaohan Wang, Serena Yeung-Levy

arXiv preprint (2024)

project paper

@article{endo2024feather,
  title={Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration},
  author={Endo, Mark and Wang, Xiaohan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2412.13180},
  year={2024}
}

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy

ECCV (2024)

project paper code

@article{VideoAgent,
  title={VideoAgent: Long-form Video Understanding with Large Language Model as Agent},
  author={Wang, Xiaohan and Zhang, Yuhui and Zohar, Orr and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2403.10517},
  year={2024}
}

Why are Visually-Grounded Language Models Bad at Image Classification?

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy

NeurIPS (2024)

project paper code

@article{VLMClassifier,
  title={Why are Visually-Grounded Language Models Bad at Image Classification?},
  author={Zhang, Yuhui and Unell, Alyssa and Wang, Xiaohan and Ghosh, Dhruba and Su, Yuchang and Schmidt, Ludwig and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2405.18415},
  year={2024}
}

Describing Differences in Image Sets with Natural Language

Lisa Dunlap*, Yuhui Zhang*, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell*, Jacob Steinhardt*, Joseph E. Gonzalez*, Serena Yeung-Levy*

CVPR (2024) Oral (90/11532)

project paper code

@inproceedings{VisDiff,
  title={Describing Differences in Image Sets with Natural Language},
  author={Dunlap, Lisa and Zhang, Yuhui and Wang, Xiaohan and Zhong, Ruiqi and Darrell, Trevor and Steinhardt, Jacob and Gonzalez, Joseph E. and Yeung-Levy, Serena},
  booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

ICLR (2024)

project paper code

@inproceedings{
  zhao2024testtime,
  title={Test-Time Adaptation with {CLIP} Reward for Zero-Shot Generalization in Vision-Language Models},
  author={Shuai Zhao and Xiaohan Wang and Linchao Zhu and Yi Yang},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=kIP0duasBb}
}

LANA: A Language-Capable Navigator for Instruction Following and Generation

Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang

CVPR (2023)

paper code

@inproceedings{wang2023lana,
  title={Lana: A language-capable navigator for instruction following and generation},
  author={Wang, Xiaohan and Wang, Wenguan and Shao, Jiayi and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={19048--19058},
  year={2023}
}

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

CVPR (2023)

paper code

@inproceedings{bike,
  title={Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models},
  author={Wu, Wenhao and Wang, Xiaohan and Luo, Haipeng and Wang, Jingdong and Yang, Yi and Ouyang, Wanli},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2023}
}

Gloss-Free End-to-End Sign Language Translation

Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang

ACL (2023) Oral

paper code

@inproceedings{lin2023gloss,
  title={Gloss-Free End-to-End Sign Language Translation},
  author={Lin, Kezhou and Wang, Xiaohan and Zhu, Linchao and Sun, Ke and Yang, Yi and others},
  booktitle={The 61st Annual Meeting Of The Association For Computational Linguistics},
  year={2023}
}

Action Sensitivity Learning for Temporal Action Localization

Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang

ICCV (2023)

paper code

@InProceedings{Shao_2023_ICCV,
    author    = {Shao, Jiayi and Wang, Xiaohan and Quan, Ruijie and Zheng, Junjun and Yang, Jiang and Yang, Yi},
    title     = {Action Sensitivity Learning for Temporal Action Localization},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {13457-13469}
}

Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark

Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang

CVPR (2022)

paper code

@inproceedings{miao2022large,
  title={Large-scale Video Panoptic Segmentation in the Wild: A Benchmark},
  author={Miao, Jiaxu and Wang, Xiaohan and  Wu, Yu and Li, Wei and Zhang, Xu and Wei, Yunchao and Yang, Yi},
  booktitle={Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

Interactive Prototype Learning for Egocentric Action Recognition

Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang

ICCV (2021)

paper

@inproceedings{wang2021interactive,
  title={Interactive prototype learning for egocentric action recognition},
  author={Wang, Xiaohan and Zhu, Linchao and Wang, Heng and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={8168--8177},
  year={2021}
}

Symbiotic Attention for Egocentric Action Recognition with Object-centric Alignment

Xiaohan Wang, Linchao Zhu, Yu Wu, Yi Yang

T-PAMI (2021)

paper code

@article{wang2020symbiotic,
  title={Symbiotic attention for egocentric action recognition with object-centric alignment},
  author={Wang, Xiaohan and Zhu, Linchao and Wu, Yu and Yang, Yi},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  volume={45},
  number={6},
  pages={6605--6617},
  year={2020},
  publisher={IEEE}
}

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

Xiaohan Wang, Linchao Zhu, Yi Yang

CVPR (2021)

paper code

@inproceedings{wang2021t2vlad,
  title={T2vlad: global-local sequence alignment for text-video retrieval},
  author={Wang, Xiaohan and Zhu, Linchao and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={5079--5088},
  year={2021}
}

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

AAAI (2020) Oral

paper code

@inproceedings{wang2020symbiotic,
  title={Symbiotic attention with privileged information for egocentric action recognition},
  author={Wang, Xiaohan and Wu, Yu and Zhu, Linchao and Yang, Yi},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={34},
  number={07},
  pages={12249--12256},
  year={2020}
}

Teaching

Co-instruct Advanced Topics in Computer Vision and Biomedicine (CS286/BIODS276) at Stanford
Guest Lecture Advanced Machine Learning (CS-5806) at Virginia Tech

Acknowledgements

This website uses the website design and template by Martin Saveski.