- Alan Yuille, Johns Hopkins University
- Tao Mei, JD Research
- Alan W Black, CMU
- Christoph Feichtenhofer, Facebook AI Research (FAIR)
|8:30-8:40||Opening remarks||Workshop Organizers|
|10:10-10:30||Coffee Break and Poster Session|
|10:50-11:05||Challenge Talk [runner up]|
|11:05-11:20||Challenge Talk [winner]|
|15:00-15:20||Coffee Break and Poster Session|
|15:30-15:35||Challenge Talk [runner up]|
|15:35-15:40||Challenge Talk [winner]|
Overview and Call For Papers
Vision and language is a recently raised research area and has received a lot of attention. Initial research and applications in this area are mainly image-focused, such as Image Captioning, Visual Question Answering, and Referring Expression. However, moving beyond static images is essential for vision and language understanding as videos contain much richer information like spatial-temporal dynamics and audio signals. So most recently, researchers in both computer vision and natural language processing communities are striving to bridge videos and natural language. Popular topics such as video captioning, video question answering, text guided video generation fall into this area. We are proposing the first Language & Vision with applications to Video Understanding in CVPR with a joint VATEX Video Captioning Challenge and a YouMakeup Video Question Answering Challenge. This workshop offers to gather researchers from multiple domains to form a new video-language community and attract more people on this topic. In the workshop, we will invite several top-tier researchers from this area to present their most recent works. We will cover different video-language related topics such as video captioning and video question answering. The invited speakers will present key architectural building blocks and novel algorithms used to solve these tasks.
This workshop covers (but is not limited to) the following topics:
- Video captioning, dialogue, and question-answering;
- Sequence learning towards bridging video and language;
- Novel tasks which combine language and video;
- Understanding the relationship between language and video in humans;
- Video synthesis from language;
- Stories as means of abstraction;
- Transfer learning across language and video;
- Joint video and language alignment and parsing;
- Cross-modal learning beyond image understanding, such as videos and audios;
- Multidisciplinary study that may involve linguistics, cognitive science, robotics, etc.
In addition, We will call for 10-15 high-quality 4 pages extended abstracts to be showcased at a poster session along with short talk spotlights. Abstracts are not archival and will not be included in the Proceedings of CVPR 2020. In the interests of fostering a freer exchange of ideas we welcome both novel and previously-published work.
This track follows the CVPR paper format. Submissions may consist of up to 4 pages of content (excluding references) in CVPR format, plus unlimited references. We are also accepting full submissions which will not be included in the Proceedings of CVPR 2020 but we will at the option of the authors provide a link to the relevant arXiv submission. The submission should be emailed as a single PDF to the firstname.lastname@example.org
The format of submitted papers to the archival track must follow the CVPR Author Guidelines. Style sheets (Latex, Word) are available here.
- Submission Deadline: May 6, 2020 (11:59pm Anywhere on Earth time, UTC-12)
- Notification: May 12, 2020
- Workshop Day: June 19, 2020
VATEX Captioning Challenge 2020
This VATEX Captioning Challenge 2020 aims to benchmark progress towards models that can describe the videos in various languages such as English and Chinese. This year, in addition to the original 34,991 videos, we release a private test set with 6,278 new videos for evaluation.
Please visit VATEX Captioning Challenge 2020 website for more details!
YouMakeup VQA Challenge
The YouMakeup VQA challenge aims to provide a common benchmark for fine-grained action understanding in domain-specific videos e.g. makeup instructional videos. The makeup instructional videos are naturally more fine-grained than open-domain videos. Different action steps contain subtle but critical differences in actions, tools and applied facial areas.
We propose two question-answering tasks to evaluate models' fine-grained action understanding abilities. The first task is Facial Image Ordering, which aims to understand visual effects of different actions expressed in natural language on facial object. The second task is Step Ordering, which aims to measure cross-modal semantic alignments between untrimmed long videos and multi-sentence texts.
Please visit YouMakeup VQA Challenge website for more details!
Organizers and PC
|University of Adelaideemail@example.com|
|UC Santa Barbarafirstname.lastname@example.org|
|Johns Hopkins Universityemail@example.com|
|JD AI Researchfirstname.lastname@example.org|
|Renmin University of Chinaemail@example.com|
|UC Santa Barbarafirstname.lastname@example.org|
|University of Adelaideemail@example.com|
|University of Oxfordfirstname.lastname@example.org|
|University of Oxfordemail@example.com|
Contact the Organizing Committee: firstname.lastname@example.org