I am a Postdoc at Stanford University, affiliated with MARVL and Stanford AI Lab. I am fortunate to work with Prof. Serena Yeung.
My research interests lie in Video Understanding, Multimodal Learning, and AI for Healthcare.
I received my Ph.D. from the University of Technology Sydney, where I was advised by Prof. Yi Yang. Prior to that, I obtained my B.E. from the University of Science and Technology of China. During my Ph.D. studies, I had the privilege to collaborate with researchers at Baidu Research and Facebook AI Research.
Most recent publications on Google Scholar.
‡ indicates equal contribution.
Temporal Preference Optimization for Long-Form Video Understanding
Rui Li*, Xiaohan Wang*, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
arXiv preprint (2025)
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia
arXiv preprint (2024)
Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning
Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy
ICLR (2025)
Video Action Differencing
James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy
ICLR (2025)
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Mark Endo, Xiaohan Wang, Serena Yeung-Levy
arXiv preprint (2024)
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy
ECCV (2024)
Why are Visually-Grounded Language Models Bad at Image Classification?
Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy
NeurIPS (2024)
Describing Differences in Image Sets with Natural Language
Lisa Dunlap*, Yuhui Zhang*, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell*, Jacob Steinhardt*, Joseph E. Gonzalez*, Serena Yeung-Levy*
CVPR (2024) Oral (90/11532)
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
ICLR (2024)
LANA: A Language-Capable Navigator for Instruction Following and Generation
Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang
CVPR (2023)
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
CVPR (2023)
Gloss-Free End-to-End Sign Language Translation
Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang
ACL (2023) Oral
Action Sensitivity Learning for Temporal Action Localization
Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang
ICCV (2023)
Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark
Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang
CVPR (2022)
Interactive Prototype Learning for Egocentric Action Recognition
Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang
ICCV (2021)
Symbiotic Attention for Egocentric Action Recognition with Object-centric Alignment
Xiaohan Wang, Linchao Zhu, Yu Wu, Yi Yang
T-PAMI (2021)
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
Xiaohan Wang, Linchao Zhu, Yi Yang
CVPR (2021)
Symbiotic Attention with Privileged Information for Egocentric Action Recognition
Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang
AAAI (2020) Oral
Temporal Preference Optimization for Long-Form Video Understanding
Rui Li*, Xiaohan Wang*, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
arXiv preprint (2025)
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia
arXiv preprint (2024)
Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning
Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy
ICLR (2025)
Video Action Differencing
James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy
ICLR (2025)
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Mark Endo, Xiaohan Wang, Serena Yeung-Levy
arXiv preprint (2024)
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy
ECCV (2024)
Why are Visually-Grounded Language Models Bad at Image Classification?
Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy
NeurIPS (2024)
Describing Differences in Image Sets with Natural Language
Lisa Dunlap*, Yuhui Zhang*, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell*, Jacob Steinhardt*, Joseph E. Gonzalez*, Serena Yeung-Levy*
CVPR (2024) Oral (90/11532)
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
ICLR (2024)
LANA: A Language-Capable Navigator for Instruction Following and Generation
Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang
CVPR (2023)
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
CVPR (2023)
Gloss-Free End-to-End Sign Language Translation
Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang
ACL (2023) Oral
Action Sensitivity Learning for Temporal Action Localization
Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang
ICCV (2023)
Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark
Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang
CVPR (2022)
Interactive Prototype Learning for Egocentric Action Recognition
Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang
ICCV (2021)
Symbiotic Attention for Egocentric Action Recognition with Object-centric Alignment
Xiaohan Wang, Linchao Zhu, Yu Wu, Yi Yang
T-PAMI (2021)
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
Xiaohan Wang, Linchao Zhu, Yi Yang
CVPR (2021)
Symbiotic Attention with Privileged Information for Egocentric Action Recognition
Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang
AAAI (2020) Oral
This website uses the website design and template by Martin Saveski.