Task Description


  • Image Ordering Sub-Challenge
  • The task is to sort a set of facial images from a video into the correct order according to given step descriptions. The goal of this task is to understand the changes that a given action described in natural language will cause to a face object. The effects of action descriptions on facial appearances can vary greatly, depending not only on the text description, but also on the previous state of the facial appearance. Some actions may bring obvious facial changes, such as "apply red lipsticks on the lips", while some actions only cause slight differences, such as "apply foundation on the face with brush", which can be better detected if the previous appearance status is known. Therefore, fine-grained multimodal analysis on visual faces and textual actions is necessary to tackle this task.

  • Step Ordering Sub-Challenge
  • The task is to sort a set of action descriptions into the right order that these actions are performed in the video. It aims at evaluating models' abilities in cross-modal semantic alignments between visual and texts. Compared with previous video-text cross-modal localization, the novelty of this task has three aspects. Firstly, different actions share similar background contexts, thus it requires the model to specifically focus on actions and action-related objects instead of correlated but irrelevant contexts. Secondly, since different actions can be very similar in visual appearance, the task demands fine-grained discrimination in particular. Finally, our task goes beyond mere single text to single video localization and requires long-term temporal action reasoning and textual understanding.


    Challenge Guidelines


  • Dataset Download
  • Please refer to the details at the Dataset page.

  • Submission
  • The challenge is hosted at the CodaLab. Please go to the challenge page to submit your results.

  • Evaluation Metrics
  • The two tasks are evaluated by accuracy of multi-choice selection.

  • Requirements
  • 1. Participants should stick to the definition of training, validation and test partition in order to have a fair comparison of different approaches.

    2. The Challenge is a team-based contest. Each team can have one or more members, and an individual cannot be a member of multiple teams.


    3. Each team can submit at most two trials a day for each sub-challenge on test partition.

    4. At the end of the Challenge, all teams will be ranked based on the evaluation described above. The top teams will receive award certificates.



  • Baseline Paper
  • The paper introducing the YouMakeup VQA Challenge baseline can be viewed here.

    The baseline codes and models are released here.


    Please cite our baseline paper as below if you find it useful.
    
    @misc{chen2020youmakeup,
        title={YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos},
        author={Shizhe Chen and Weiying Wang and Ludan Ruan and Linli Yao and Qin Jin},
        year={2020},
        eprint={2004.05573},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }