The 2nd Workshop on Computer Vision in the Wild
@ CVPR 2023, June 19


State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concepts.

Recent works show that learning from large-scale image-text data is a promising approach to building transferable visual models that can effortlessly adapt to a wide range of downstream computer vision (CV) and multimodal (MM) tasks. For example, CLIP, ALIGN and Florence for image classification, ViLD, RegionCLIP, GLIP and OWL-ViT for object detection, GroupViT, OpenSeg, MaskCLIP and X-Decoder for segmentation. These vision models with language interface are naturally open-vocabulary recogntion models, showing superior zero-shot and few-shot adaption performance on various real-world scenarios.

We host this "Computer Vision in the Wild (CVinW)" workshop, aiming to gather academic and industry communities to work on CV and MM problems in real-world scenarios, focusing on the challenge of open-set/domain visual recognition at different granularities and efficient task-level transfer. To measure the progress of CVinW, we develop new benchmarks for image classification, object detection and segmentation to measure the task-level transfer ablity of various models/methods over diverse real-world datasets, in terms of both prediction accuracy and adaption efficiency. This workshop is a continuation of our ECCV 2022 CVinW Workshop. For those who are new to this topic, please check out the CVinW Reading List.


Feb, 2023 Competition starts, testing phase begins
June 2nd, 2023 Competition ends (challenge paper submission)
April 21st, 2023 Workshop paper submission deadline
May 19th, 2023 Workshop paper acceptance decision to authors
June 2nd, 2023 Camera-ready submission deadline

Invited Speakers/Panelists

Kristen Grauman
University of Texas at Austin

Boqing Gong

Justin Johnson
University of Michigan | FAIR

Yinfei Yang

Bryan A. Plummer
Boston University

Ziwei Liu

Jacob Solawetz

Anelia Angelova
Google Brain

Jiasen Lu
Allen Instutite for AI

Katerina Fragkiadaki

Dhruv Batra
Georgia Tech | FAIR

Call for Papers

Topics of interest include but are not limited to:

  • Open-set visual recognition methods, including classification, object detection, segmentation in images and videos
  • Zero/Few-shot text-to-image generation/editing; Open-domain visual QA & image captioning
  • Unified neural networks architectures and training objectives over different CV & MM tasks
  • Large-scale pre-training, with images/videos only, image/video-text pairs, and external knoweldge
  • Efficient large visual model adaptation methods, measured by #training samples (zero-shot and few-shot), #trainable parameters, throughput, training cost
  • New metrics / benchmarks / datasets to evaluate task-level transfer and open-set visual recognition

  • We accept abstract submissions to our workshop. All submissions shall have maximally 8 pages (excluding references) following the CVPR 2023 author guidelines. All submissions will be reviewed by the Program Committee on the basis of technical quality, relevance to scope of the conference, originality, significance, and clarity. Ther review process is double-blind, and the accepted papers are NOT archived in CVPR proceeding.

    Workshop Paper Submission Portal: [CMT]

Computer Vision in the Wild Challenges

The two new challenges are developed:

Eval Datasets
Eval Metrics
Make a Challenge Submission
25 Image Segmentation Datasets
Zero, few, full-shot
100 Object Detection Datasets
Zero, few, full-shot
    The two existing challenges associated with this workshop: "Image Classification in the Wild" (ICinW) and "Object Detection in the Wild" (ODinW). We summarize their evaluation datasets and metrics in the table below.

    Eval Datasets
    Eval Metrics
    Make a Challenge Submission
    20 Image Classification Datasets
    Zero, few, full-shot
    35 Object Detection Datasets
    Zero, few, full-shot
    To prevent a race purely in pre-training data and model size, we will have two tracks.
  • For the academic track, pre-training data is limited: (1) ICinW: ImageNet21K (Removing ImageNet1K), CC3M+CC12M, YFCC15M; (2) ODinW: Objects365; (3) SGinW: COCO, RefCOCO-g.
  • For the industry track, there is no limitation on pre-training data and model size. Teams are required to disclose meta info of model and data if extra data is used. Here are some publicly available image-text datasets: (1) FLAVA Public Multimodal Datasets (PMD) corpus with 70M pairs; (2) LAION with 400M or 5B pairs.

  • Please see the submission pages for detailed requirements in each Challenge -> Track -> Phase. More information about the challenge benchmark is released: [Benchmark] [Document] [Data Download]. Please reach out if you have any issue in submissions.

Workshop Organizers

Jianwei Yang

Haotian Zhang

Haotian Liu
UW Madison

Xiuye Gu

Chunyuan Li

Neil Houlsby

Jianfeng Gao

Challenge Organizers (TBD)

Xueyan Zou
UW Madison

Francesco Zuppichini

Workshop and Challenge Questions?
Reach out:
Workshop Organizing Team