Workshop on Computer Vision in the Wild
@ ECCV 2022, October 23 Virtual Meeting (Video List on YouTube and Bilibili)
9:00am-6:00pm Israeli Time || 11:00pm (October 22)-8:00am Pacific Time || 2:00pm-11:00pm Beijing Time


State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept.

Recent works show that learning from large-scale image-text data is a promising approach to building transferable visual models that can effortlessly adapt to a wide range of downstream computer vision (CV) and multimodal (MM) tasks. For example, CLIP , ALIGN and Florence for image classification, ViLD , RegionCLIP and GLIP for object detection. These vision models with language interface are naturally open-vocabulary recogntion models, showing superior zero-shot and few-shot adaption performance on various real-world scenarios.

We propose this "Computer Vision in the Wild (CVinW)" workshop, aiming to gather academic and industry communities to work on CV problems in real-world scenarios, focusing on the challenge of open-set/domain visual recognition and efficient task-level transfer. Since there is no established benchmarks to measure the progress of "CV in the Wild", we develop new benchmarks for image classification and object detection, to measure the task-level transfer ablity of various models/methods over diverse real-world datasets, in terms of both prediction accuracy and adaption efficiency. This workshop will also host two challenges based on the ELEVATER benchmarks.

For those who are new to this topic, please check out the CVinW Reading List.

The 2nd CVinW Workshop & Challenge at CVPR 2023
Related Call for Paper (Deadline: April 1, 2023): International Journal of Computer Vision (IJCV) special issue on ``Promises and Dangers of Large Vision Models''

Invited Speakers

Kate Saenko
Boston University

Stella Yu
University of Michigan

Ishan Misra

Yin Cui

Yongqin Xian

Xiaolong Wang
UC San Diego

Matthias Minderer

Xiaohua Zhai

Neil Houlsby

Program (Video List on YouTube and Bilibili)

9 AM - 9:15 AM IDT

(11 PM - 11:15 PM PT - 10/22)
Welcome [YouTube | Bilibili]
Jianfeng Gao - Microsoft Research
9:15 AM - 9:45 AM IDT

(11:15 PM - 11:45 PM PT - 10/22)
Invited Talk [YouTube | Bilibili]
Xiaolong Wang (UC San Diego)

Title: Robot Perception in the Wild
Abstract [Expand]
Visual representation learning has achieved tremendous success in semantic and geometric understanding. But how can this knowledge help robots interact with the physical world and operate in the wild? In this talk, I will introduce our studies on learning rich and generalizable visual representations for such a purpose. Specifically, I will talk about our work on learning open vocabulary semantic representations with only text supervision, and 3D object representations from videos in the wild using self-supervision. At the end of the talk, I will briefly show demos of how these techniques can help improve real-world robotics tasks including dexterous hand manipulation and legged robot locomotion control.
9:45 AM - 10:15 AM IDT

(11:45 PM - 12:15 PM PT - 10/22)
Invited Talk: Google Brain Zurich Team Tutorial (1/3) [YouTube | Bilibili]
Neil Houlsby (Google)

Title: Architectures Beyond CNNs and Visual Scaling Laws
Abstract [Expand]
I will present some of our work that has explored the capabilities of non-convolutional architectures for Computer Vision, such as Transformers, Mixers, and Mixture-of-expert based models. These architectures often demonstrate favourable properties in the context of transfer learning from a large source dataset to a small target datasets. In this context, I will discuss our exploration into these models' scaling laws, improved scaling law estimators, and the apparent saturation of larger vision models.
10:15 AM - 10:45 AM IDT

(12:15 AM - 12:45 AM PT)
Invited Talk: Google Brain Zurich Team Tutorial (2/3) [YouTube | Bilibili]
Xiaohua Zhai (Google)

Title: Scaling Vision and Language Learning with Vision Transformers
Abstract [Expand]
Attention-based neural networks such as the Vision Transformer (ViT) have recently achieved state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in obtaining excellent results. In this talk, I will first present how to train a text model to "read out'' good representations from a pre-trained and locked Vision Transformer model for new tasks, named "Locked-image Tuning" (LiT). A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. Then I will discuss how to use the mixture of experts model (LIMoE) for contrastive learning, which accepts both images and text simultaneously. Finally I will share how to reuse pre-trained ViT models and pre-trained encoder-decoder language models in PaLI, to support vision and language downstream tasks such as captioning and VQA across over 100 languages. When scaling up the PaLI visual component from ViT-G to ViT-e, we observed a significant boost on vision and language tasks.
10:45 AM - 11:15 AM IDT

(12:45 PM - 1:15 AM PT)
Invited Talk: Google Brain Zurich Team Tutorial (3/3) [YouTube | Bilibili]
Matthias Minderer (Google)

Title: Beyond image-level tasks: Scalable approaches to localization and dense prediction
Abstract [Expand]
Scaling of vision-language models has been very successful for image-level tasks such as classification and retrieval. I will present work on extending this approach beyond the image level, to structured tasks such as detection, segmentation, depth estimation, and colorization. I will first discuss a simple architecture and recipe for transferring vision-language models to open-vocabulary object detection (OWL-ViT). Second, I will discuss a general image modeling approach which combines a language and an image model to unify many dense prediction tasks (UViM). In this approach, the language model learns to represent structured, interdependent data features, while the image model efficiently deals with high-dimensional pixel-wise output. These methods are competitive with the respective state-of-the-art methods, while being simpler and more general.
11:15 AM - 12 PM IDT

(1:15 AM - 2 AM PT)
Challenge Summary on ICinW and ODinW [YouTube | Bilibili]
Chunyuan Li (Microsoft Research)

Winner team presentations (Decided by October 20, 2022) [YouTube | Bilibili]
- ICinW Industry Track | Chinese CLIP | Junyang Lin (Alibaba)
- ICinW Academic Track | K-LITE | Sheng Shen (University of California, Berkeley)
- ICinW ImageNet-1K in Pre-training | Bamboo | Yuanhan Zhang (Nanyang Technological University)
- ICinW Parameter-Efficiency | ProDA | Yuning Lu (University of Science and Technology of China)
- ODinW Zero-Shot Track | DetCLIP | Jianhua Han (Huawei)
- ODinW Full-Shot Track | DINO | Shilong Liu (IDEA & Tsinghua), Hao Zhang (IDEA & HKUST)
12 PM - 1:30 PM IDT

(2 AM - 3:30 AM PT)

Lunch Break

1:30 PM - 2 PM IDT

(3:30 AM - 4 AM PT)
Invited Talk [YouTube | Bilibili]
Title: General purpose visual recognition systems: beyond a single modality and a task
Ishan Misra (Meta)

Abstract [Expand]
Modern computer vision models are good at specialized tasks. Given the right architecture, right supervision, supervised learning can yield great specialist models. However, specialist models also have severe limitations — they can only do what they are trained for and require copious amounts of pristine supervision for it. In this talk, I’ll focus on two limitations: specialist models cannot work on tasks beyond what they saw training labels for, or on new types of visual data. I’ll present our recent efforts that design better architectures, training paradigms and loss functions to address these issues. Our first line of work, called Omnivore, presents a single model that can operate on images, videos, and single-view 3D data. Omnivore leads to shared representations across visual modalities, without using paired input data. Omnivore can also be trained in a self-supervised manner. I'll conclude the talk with general purpose detection and segmentation models. We developed Detic, a simple way to train large-vocabulary detectors using image-level labs which leads to a 20,000+ class detector. We also proposed Mask2Former which is a single meta architecture for all types of image and video segmentation tasks.
2 PM - 2:30 PM IDT

(4 AM - 4:30 AM PT)
Invited Talk [YouTube | Bilibili]
Title: Learning Unsupervised Semantic Embeddings for Zero-Shot Image Classification
Yongqin Xian (Google)

Abstract [Expand]
Despite the tremendous progress in zero-shot learning (ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this talk, I will first present visually-grounded semantic embedding (VGSE) that enhances the word embeddings by mapping them into a latent space learned by image regions clustering. In the second part, I will introduce Image to Document Transformer (I2DFormer), a new transformer-based ZSL framework that learns unsupervised semantic embeddings from images and class-level online textual documents, e.g., Wikipedia. We empirically show that both methods significantly outperform previous unsupervised semantic embeddings on three public datasets and lead to highly interpretable results.
2:30 PM - 3:45 PM IDT

(4:30 AM - 5:45 AM PT)
Spotlight Paper Presentations [YouTube | Bilibili]
- MS-CLIP | Haoxuan You (Columbia University)
- CLIP understands Texture? | Chenyun Wu (UMass Amherst)
- Synthetic Data for Infrequent OD | Ninad Kulkarni (Amazon Web Services)
- XMem | Ho Kei Cheng (University of Illinois Urbana-Champaign)
- AUGCO | Viraj Prabhu (Georgia Tech)
- Unsupervised Selective Labeling | Long Lian (UC Berkeley)
- Diffusion Models for Outfit Rendering | Vignesh Srinivasan (Zalando Research)
- OmDet | Tiancheng Zhao (Binjiang Institute of Zhejiang University)

3:45 PM - 4:30 PM IDT

(5:45 AM - 6:30 AM PT)

Afternoon Break

4:30 PM - 5 PM IDT

(6:30 AM - 7 AM PT)
Invited Talk [YouTube | Bilibili]
Kate Saenko (Boston University)

Title: Data Shift Happens, What To Do About It?
Abstract [Expand]
In computer vision, generalization of learned representations is usually measured on i.i.d. data. This hides the fact that models often struggle to generalize to non-i.i.d data and fail to overcome the biases inherent in visual datasets. Labeling additional data in each new situation is the standard solution but is often prohibitively expensive. I will discuss some recent work in my lab addressing the core challenges in overcoming dataset bias, including adaptation to natural domain shifts, sim2real transfer, avoiding spurious correlations, and the role of pretraining in generalizability.
5 PM - 5:30 PM IDT

(7 AM - 7:30 AM PT)
Invited Talk [YouTube | Bilibili]
Yin Cui (Google)

Title: Open-Vocabulary Visual Perception upon Frozen Vision and Language Models
Abstract [Expand] [Slides]
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs has become a promising paradigm for open-vocabulary visual perception. In our recent explorations, we developed open-vocabulary models for detection based on distilling VLMs on existing detection data (ViLD), and for segmentation based on aligning image regions with image captions (OpenSeg). In this talk, I will focus on how to greatly simplify the paradigm by directly building upon frozen VLMs like CLIP with minimal modifications. In the first part, I will present our open-vocabulary detection model F-VLM that achieves state-of-the-art performance on the LVIS benchmark by only training a light-weight detector head. In the second part, I will show how we leverage motion and audio to help video generalize better to novel classes. Our model MOV encodes video, audio and flow with the same pre-trained CLIP’s vision encoder (frozen for video). We design an asymmetrical cross-attention module to aggregate multimodal information. MOV achieves state-of-the-art performance on UCF and HMDB, outperforming both traditional zero-shot methods and recent CLIP-based adaptation methods.
5:30 PM - 6 PM IDT

(7:30 AM - 8 AM PT)
Invited Talk [YouTube | Bilibili]
Stella Yu (University of Michigan)

Title: Learning Mid-Level Vision from Nothing but Data
Abstract [Expand]
Computer vision with deep learning has achieved super-human performance on various benchmarks. However, deep neural network models are highly specialized for the task and the data they are trained on. In contrast, babies with normal vision eventually all learn to see from their widely different visual experiences. I attribute this fascinating development of universal visual perception to the ability of learning mid-level visual representations from data without any external supervision. I will present our recent work on unsupervised learning of visual recognition from unlabeled videos and images, demonstrating that structures in the visual data can be discovered from nothing but data with minimal priors and model bottlenecks.

Accepted Papers

Computer Vision in the Wild Challenges

    There are two challenges associated with this workshop: "Image Classification in the Wild" (ICinW) and "Object Detection in the Wild" (ODinW). We summarize their evaluation datasets and metrics in the table below.

    Eval Datasets
    Eval Metrics
    Make a Challenge Submission
    20 Image Classification Datasets
    Zero, few, full-shot
    35 Object Detection Datasets
    Zero, few, full-shot
    To prevent a race purely in pre-training data and model size, we will have two tracks.
  • For the academic track, pre-training data is limited: (1) ICinW: ImageNet21K (Removing ImageNet1K), CC3M+CC12M, YFCC15M; (2) ODinW: Objects365.
  • For the industry track, there is no limitation on pre-training data and model size. Teams are required to disclose meta info of model and data if extra data is used. Here are some publicly available image-text datasets: (1) FLAVA Public Multimodal Datasets (PMD) corpus with 70M pairs; (2) LAION with 400M or 5B pairs.

  • Please see the submission pages for detailed requirements in each Challenge -> Track -> Phase. More information about the challenge benchmark is released: [Benchmark] [Document] [Data Download]. Please reach out if you have any issue in submissions.

    Please add a link to your participant team so that others can associate the submissions on the leaderboard with its corresponding team and work.
    • Click "Participation Teams" on the left side-bar
    • Find your participation team for the submission, click edit (the pencil icon)
    • Update "Team URL (Optional)" with your paper ArXiv link, GitHub page, etc.


July 25, 2022 Competition starts, testing phase begins
October 14, 2022 Competition ends (challenge paper submission)
October 14, 2022 (Extended) Workshop paper submission deadline
October 17,2022 Workshop paper acceptance decision to authors
October 20,2022 Camera-ready submission deadline

Workshop Organizers

Chunyuan Li

Jyoti Aneja

Jianwei Yang

Xin Wang

Pengchuan Zhang
Meta AI

Haotian Liu
UW Madison

Haotian Zhang
University of Washington

Liunian Li

Aishwarya Kamath

Challenge Organizers

Yinfei Yang

Yi-Ting Chen

Ye Xia

Yangguang Li

Feng Liang
UT Austin

Yufeng Cui

Ping Jin

Shohei Ono

Houwen Peng

Saining Xie

Han Hu

Amanpreet Singh

Xiaojie Jin

Jiashi Feng

Junyang Lin

An Yang

Peng Wang

Nguyen Bach

Yuning Lu

Yuanhan Zhang

Kaiyang Zhou

Ziwei Liu

Shilong Liu
Tsinghua University

Feng Li

Hao Zhang

Jianfeng Wang

Lijuan Wang

Xuehai He

Xin Eric Wang

Changyou Chen
University at Buffalo, SUNY

Yi Xu

Haoxuan You
Columbia University

Advisory Committee

Trevor Darrell
UC Berkley

Lei Zhang

Yong Jae Lee
UW Madison

Houdong Hu

Zicheng Liu

Ce Liu

Xuedong Huang

Kai-Wei Chang

Jingdong Wang

Zhuowen Tu

Jianfeng Gao

Jenq-Neng Hwang
University of Washington

Yann LeCun

Disclaimer: To ensure fair comparisons in the challenge, the evaluation server and leaderboards are independently developed and maintained by the Workshop Organizers, while the Challege Organizers actively promote and contribute to the competitions.

Workshop and Challenge Questions?
Reach out:
Workshop Organizing Team