Invited Speakers


8:30-8:40 Opening remarks Workshop Organizers
8:40-9:10 Invited Talk
9:10-9:40 Invited Talk
9:40-10:10 Invited Talk
10:10-10:30 Coffee Break and Poster Session
10:30-10:50 VATEX Challenge
10:50-11:05 Challenge Talk [runner up]
11:05-11:20 Challenge Talk [winner]
11:20-11:50 Invited Talk
11:50-13:30 Lunch
13:30-14:00 Invited Talk
14:00-14:30 Invited Talk
14:30-15:00 Invited Talk
15:00-15:20 Coffee Break and Poster Session
15:20-15:30 YouMakeup Challenge
15:30-15:35 Challenge Talk [runner up]
15:35-15:40 Challenge Talk [winner]
15:40-16:20 Poster Highlights
16:20-16:50 Invited Talk
16:50-17:30 Panel Discussion

Overview and Call For Papers

Vision and language is a recently raised research area and has received a lot of attention. Initial research and applications in this area are mainly image-focused, such as Image Captioning, Visual Question Answering, and Referring Expression. However, moving beyond static images is essential for vision and language understanding as videos contain much richer information like spatial-temporal dynamics and audio signals. So most recently, researchers in both computer vision and natural language processing communities are striving to bridge videos and natural language. Popular topics such as video captioning, video question answering, text guided video generation fall into this area. We are proposing the first Language & Vision with applications to Video Understanding in CVPR with a joint VATEX Video Captioning Challenge and a YouMakeup Video Question Answering Challenge. This workshop offers to gather researchers from multiple domains to form a new video-language community and attract more people on this topic. In the workshop, we will invite several top-tier researchers from this area to present their most recent works. We will cover different video-language related topics such as video captioning and video question answering. The invited speakers will present key architectural building blocks and novel algorithms used to solve these tasks.

This workshop covers (but is not limited to) the following topics:

In addition, We will call for 10-15 high-quality 4 pages extended abstracts to be showcased at a poster session along with short talk spotlights. Abstracts are not archival and will not be included in the Proceedings of CVPR 2020. In the interests of fostering a freer exchange of ideas we welcome both novel and previously-published work.

Submission details

This track follows the CVPR paper format. Submissions may consist of up to 4 pages of content (excluding references) in CVPR format, plus unlimited references. We are also accepting full submissions which will not be included in the Proceedings of CVPR 2020 but we will at the option of the authors provide a link to the relevant arXiv submission. The submission should be emailed as a single PDF to the

The format of submitted papers to the archival track must follow the CVPR Author Guidelines. Style sheets (Latex, Word) are available here.

Important Dates


VATEX Captioning Challenge 2020

This VATEX Captioning Challenge 2020 aims to benchmark progress towards models that can describe the videos in various languages such as English and Chinese. This year, in addition to the original 34,991 videos, we release a private test set with 6,278 new videos for evaluation.

Please visit VATEX Captioning Challenge 2020 website for more details!

YouMakeup VQA Challenge

The YouMakeup VQA challenge aims to provide a common benchmark for fine-grained action understanding in domain-specific videos e.g. makeup instructional videos. The makeup instructional videos are naturally more fine-grained than open-domain videos. Different action steps contain subtle but critical differences in actions, tools and applied facial areas.

We propose two question-answering tasks to evaluate models' fine-grained action understanding abilities. The first task is Facial Image Ordering, which aims to understand visual effects of different actions expressed in natural language on facial object. The second task is Step Ordering, which aims to measure cross-modal semantic alignments between untrimmed long videos and multi-sentence texts.

Please visit YouMakeup VQA Challenge website for more details!

Organizers and PC


  • Qi Wu
  • University of Adelaide
  • Xin Wang
  • UC Santa Barbara
  • Chenxi Liu
  • Johns Hopkins University
  • Licheng Yu
  • Microsoft
  • Lu Jiang
  • Google AI
  • Yan Huang
  • UCAS, China
  • Ting Yao
  • JD AI Research
  • Qin Jin
  • Renmin University of China
  • William Wang
  • UC Santa Barbara
  • Anton van den Hengel
  • University of Adelaide
  • Andrei Barbu
  • MIT
  • Siddharth N
  • University of Oxford
  • Dan Gutfreund
  • IBM
  • Philip Torr
  • University of Oxford

    Contact the Organizing Committee: