Invited Speakers

Schedule [Live Session Landing Page]

9:00-9:10 Opening Remarks [Live] Workshop Organizers
9:10-9:50 Invited Talk 1 + Live Q&A
Compositional Models for Open-Set Activity Classification of Videos
Alan Yuille
9:50-10:30 Invited Talk 2 + Live Q&A
Efficient Video Recognition
Christoph Feichtenhofer
10:30-10:40 Coffee Break
10:40-11:20 Invited Short Oral (workshop paper session)
11:20-14:00 Lunch
14:00-14:40 Invited Talk 3 + Live Q&A
Image and Video Captioning
Qin Jin
14:40-15:10 Invited Talk 4
Vision and Language: From Perception to Creation
Tao Mei
15:10-15:30 Coffee Break
15:30-15:40 YouMakeup Challenge Shizhe Chen [Slides]
15:40-16:00 Challenge Talk + Live Q&A
Baseline Introduction and Analysis
Ludan Ruan, Linli Yao,
Weiying Wang
16:00-16:20 Challenge Talk (winner)
Multi-modal Feature Fusion for YouMakeup Video Question Answering

BUPTMM Submission for YouMakeUp VQA Challenge in Step Ordering Task

Yajie Zhang, Li Su [report]
Kun Liu, Huadong Ma [report]
16:20-16:30 VATEX Captioning Challenge [Live] Xin Wang
16:30-16:50 Challenge Talk (winner) + Live Q&A
Multi-View Features and Hybrid Reward Strategies for Video Captioning
Xinxin Zhu, Longteng Guo, Peng Yao,
Shichen Lu, Wei Liu, Jing Liu
16:50-17:10 Challenge Talk (runner-up) + Live Q&A
Multi-modal Feature Fusion with Feature Attention for Video Captioning
Ke Lin, Zhuoxin Gan, Liwei Wang
17:10-17:15 Ending [Live] Workshop Organizers

Overview and Call For Papers

Vision and language is a recently raised research area and has received a lot of attention. Initial research and applications in this area are mainly image-focused, such as Image Captioning, Visual Question Answering, and Referring Expression. However, moving beyond static images is essential for vision and language understanding as videos contain much richer information like spatial-temporal dynamics and audio signals. So most recently, researchers in both computer vision and natural language processing communities are striving to bridge videos and natural language. Popular topics such as video captioning, video question answering, text guided video generation fall into this area. We are proposing the first Language & Vision with applications to Video Understanding in CVPR with a joint VATEX Video Captioning Challenge and a YouMakeup Video Question Answering Challenge. This workshop offers to gather researchers from multiple domains to form a new video-language community and attract more people on this topic. In the workshop, we will invite several top-tier researchers from this area to present their most recent works. We will cover different video-language related topics such as video captioning and video question answering. The invited speakers will present key architectural building blocks and novel algorithms used to solve these tasks.

This workshop covers (but is not limited to) the following topics:

In addition, We will call for 10-15 high-quality 4 pages extended abstracts to be showcased at a poster session along with short talk spotlights. Abstracts are not archival and will not be included in the Proceedings of CVPR 2020. In the interests of fostering a freer exchange of ideas we welcome both novel and previously-published work.

Submission details

This track follows the CVPR paper format. Submissions may consist of up to 4 pages of content (excluding references) in CVPR format, plus unlimited references. We are also accepting full submissions which will not be included in the Proceedings of CVPR 2020 but we will at the option of the authors provide a link to the relevant arXiv submission. The submission should be emailed as a single PDF to the

The format of submitted papers to the archival track must follow the CVPR Author Guidelines. Style sheets (Latex, Word) are available here.

Important Dates


VATEX Captioning Challenge 2020

This VATEX Captioning Challenge 2020 aims to benchmark progress towards models that can describe the videos in various languages such as English and Chinese. This year, in addition to the original 34,991 videos, we release a private test set with 6,278 new videos for evaluation.

Please visit VATEX Captioning Challenge 2020 website for more details!

YouMakeup VQA Challenge

The YouMakeup VQA challenge aims to provide a common benchmark for fine-grained action understanding in domain-specific videos e.g. makeup instructional videos. The makeup instructional videos are naturally more fine-grained than open-domain videos. Different action steps contain subtle but critical differences in actions, tools and applied facial areas.

We propose two question-answering tasks to evaluate models' fine-grained action understanding abilities. The first task is Facial Image Ordering, which aims to understand visual effects of different actions expressed in natural language on facial object. The second task is Step Ordering, which aims to measure cross-modal semantic alignments between untrimmed long videos and multi-sentence texts.

Please visit YouMakeup VQA Challenge website for more details!

Organizers and PC


  • Qi Wu
  • University of Adelaide
  • Xin Wang
  • UC Santa Cruz
  • Chenxi Liu
  • Johns Hopkins University
  • Licheng Yu
  • Facebook AI
  • Lu Jiang
  • Google AI
  • Yan Huang
  • UCAS, China
  • Ting Yao
  • JD AI Research
  • Qin Jin
  • Renmin University of China
  • William Wang
  • UC Santa Barbara
  • Anton van den Hengel
  • University of Adelaide

    Contact the Organizing Committee: