Workshop on Computer Vision in the Wild
@ ECCV 2022, October 23 Virtual Meeting (Video List on YouTube and Bilibili)
9:00am-6:00pm Israeli Time || 11:00pm (October 22)-8:00am Pacific Time || 2:00pm-11:00pm Beijing Time

Overview

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept.

Recent works show that learning from large-scale image-text data is a promising approach to building transferable visual models that can effortlessly adapt to a wide range of downstream computer vision (CV) and multimodal (MM) tasks. For example, CLIP , ALIGN and Florence for image classification, ViLD , RegionCLIP and GLIP for object detection. These vision models with language interface are naturally open-vocabulary recogntion models, showing superior zero-shot and few-shot adaption performance on various real-world scenarios.

We propose this "Computer Vision in the Wild (CVinW)" workshop, aiming to gather academic and industry communities to work on CV problems in real-world scenarios, focusing on the challenge of open-set/domain visual recognition and efficient task-level transfer. Since there is no established benchmarks to measure the progress of "CV in the Wild", we develop new benchmarks for image classification and object detection, to measure the task-level transfer ablity of various models/methods over diverse real-world datasets, in terms of both prediction accuracy and adaption efficiency. This workshop will also host two challenges based on the ELEVATER benchmarks.

For those who are new to this topic, please check out the CVinW Reading List.

The 2nd CVinW Workshop & Challenge at CVPR 2023
Related Call for Paper (Deadline: April 1, 2023): International Journal of Computer Vision (IJCV) special issue on ``Promises and Dangers of Large Vision Models''


Invited Speakers



Kate Saenko
Boston University



Stella Yu
University of Michigan



Ishan Misra
META



Yin Cui
Google



Yongqin Xian
Google



Xiaolong Wang
UC San Diego



Matthias Minderer
Google



Xiaohua Zhai
Google



Neil Houlsby
Google


Program (Video List on YouTube and Bilibili)

9 AM - 9:15 AM IDT

(11 PM - 11:15 PM PT - 10/22)
Welcome [YouTube | Bilibili]
Jianfeng Gao - Microsoft Research
9:15 AM - 9:45 AM IDT

(11:15 PM - 11:45 PM PT - 10/22)
Invited Talk [YouTube | Bilibili]
Xiaolong Wang (UC San Diego)

Title: Robot Perception in the Wild
Abstract [Expand]
Visual representation learning has achieved tremendous success in semantic and geometric understanding. But how can this knowledge help robots interact with the physical world and operate in the wild? In this talk, I will introduce our studies on learning rich and generalizable visual representations for such a purpose. Specifically, I will talk about our work on learning open vocabulary semantic representations with only text supervision, and 3D object representations from videos in the wild using self-supervision. At the end of the talk, I will briefly show demos of how these techniques can help improve real-world robotics tasks including dexterous hand manipulation and legged robot locomotion control.
9:45 AM - 10:15 AM IDT

(11:45 PM - 12:15 PM PT - 10/22)
Invited Talk: Google Brain Zurich Team Tutorial (1/3) [YouTube | Bilibili]
Neil Houlsby (Google)

Title: Architectures Beyond CNNs and Visual Scaling Laws
Abstract [Expand]
I will present some of our work that has explored the capabilities of non-convolutional architectures for Computer Vision, such as Transformers, Mixers, and Mixture-of-expert based models. These architectures often demonstrate favourable properties in the context of transfer learning from a large source dataset to a small target datasets. In this context, I will discuss our exploration into these models' scaling laws, improved scaling law estimators, and the apparent saturation of larger vision models.
10:15 AM - 10:45 AM IDT

(12:15 AM - 12:45 AM PT)
Invited Talk: Google Brain Zurich Team Tutorial (2/3) [YouTube | Bilibili]
Xiaohua Zhai (Google)

Title: Scaling Vision and Language Learning with Vision Transformers
Abstract [Expand]
Attention-based neural networks such as the Vision Transformer (ViT) have recently achieved state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in obtaining excellent results. In this talk, I will first present how to train a text model to "read out'' good representations from a pre-trained and locked Vision Transformer model for new tasks, named "Locked-image Tuning" (LiT). A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. Then I will discuss how to use the mixture of experts model (LIMoE) for contrastive learning, which accepts both images and text simultaneously. Finally I will share how to reuse pre-trained ViT models and pre-trained encoder-decoder language models in PaLI, to support vision and language downstream tasks such as captioning and VQA across over 100 languages. When scaling up the PaLI visual component from ViT-G to ViT-e, we observed a significant boost on vision and language tasks.
10:45 AM - 11:15 AM IDT

(12:45 PM - 1:15 AM PT)
Invited Talk: Google Brain Zurich Team Tutorial (3/3) [YouTube | Bilibili]
Matthias Minderer (Google)

Title: Beyond image-level tasks: Scalable approaches to localization and dense prediction
Abstract [Expand]
Scaling of vision-language models has been very successful for image-level tasks such as classification and retrieval. I will present work on extending this approach beyond the image level, to structured tasks such as detection, segmentation, depth estimation, and colorization. I will first discuss a simple architecture and recipe for transferring vision-language models to open-vocabulary object detection (OWL-ViT). Second, I will discuss a general image modeling approach which combines a language and an image model to unify many dense prediction tasks (UViM). In this approach, the language model learns to represent structured, interdependent data features, while the image model efficiently deals with high-dimensional pixel-wise output. These methods are competitive with the respective state-of-the-art methods, while being simpler and more general.
11:15 AM - 12 PM IDT

(1:15 AM - 2 AM PT)
Challenge Summary on ICinW and ODinW [YouTube | Bilibili]
Chunyuan Li (Microsoft Research)
[Slides]

Winner team presentations (Decided by October 20, 2022) [YouTube | Bilibili]
- ICinW Industry Track | Chinese CLIP | Junyang Lin (Alibaba)
- ICinW Academic Track | K-LITE | Sheng Shen (University of California, Berkeley)
- ICinW ImageNet-1K in Pre-training | Bamboo | Yuanhan Zhang (Nanyang Technological University)
- ICinW Parameter-Efficiency | ProDA | Yuning Lu (University of Science and Technology of China)
- ODinW Zero-Shot Track | DetCLIP | Jianhua Han (Huawei)
- ODinW Full-Shot Track | DINO | Shilong Liu (IDEA & Tsinghua), Hao Zhang (IDEA & HKUST)
12 PM - 1:30 PM IDT

(2 AM - 3:30 AM PT)

Lunch Break



1:30 PM - 2 PM IDT

(3:30 AM - 4 AM PT)
Invited Talk [YouTube | Bilibili]
Title: General purpose visual recognition systems: beyond a single modality and a task
Ishan Misra (Meta)

Abstract [Expand]
Modern computer vision models are good at specialized tasks. Given the right architecture, right supervision, supervised learning can yield great specialist models. However, specialist models also have severe limitations — they can only do what they are trained for and require copious amounts of pristine supervision for it. In this talk, I’ll focus on two limitations: specialist models cannot work on tasks beyond what they saw training labels for, or on new types of visual data. I’ll present our recent efforts that design better architectures, training paradigms and loss functions to address these issues. Our first line of work, called Omnivore, presents a single model that can operate on images, videos, and single-view 3D data. Omnivore leads to shared representations across visual modalities, without using paired input data. Omnivore can also be trained in a self-supervised manner. I'll conclude the talk with general purpose detection and segmentation models. We developed Detic, a simple way to train large-vocabulary detectors using image-level labs which leads to a 20,000+ class detector. We also proposed Mask2Former which is a single meta architecture for all types of image and video segmentation tasks.
2 PM - 2:30 PM IDT

(4 AM - 4:30 AM PT)
Invited Talk [YouTube | Bilibili]
Title: Learning Unsupervised Semantic Embeddings for Zero-Shot Image Classification
Yongqin Xian (Google)

Abstract [Expand]
Despite the tremendous progress in zero-shot learning (ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this talk, I will first present visually-grounded semantic embedding (VGSE) that enhances the word embeddings by mapping them into a latent space learned by image regions clustering. In the second part, I will introduce Image to Document Transformer (I2DFormer), a new transformer-based ZSL framework that learns unsupervised semantic embeddings from images and class-level online textual documents, e.g., Wikipedia. We empirically show that both methods significantly outperform previous unsupervised semantic embeddings on three public datasets and lead to highly interpretable results.
2:30 PM - 3:45 PM IDT

(4:30 AM - 5:45 AM PT)
Spotlight Paper Presentations [YouTube | Bilibili]
- MS-CLIP | Haoxuan You (Columbia University)
- CLIP understands Texture? | Chenyun Wu (UMass Amherst)
- Synthetic Data for Infrequent OD | Ninad Kulkarni (Amazon Web Services)
- XMem | Ho Kei Cheng (University of Illinois Urbana-Champaign)
- AUGCO | Viraj Prabhu (Georgia Tech)
- Unsupervised Selective Labeling | Long Lian (UC Berkeley)
- Diffusion Models for Outfit Rendering | Vignesh Srinivasan (Zalando Research)
- OmDet | Tiancheng Zhao (Binjiang Institute of Zhejiang University)

3:45 PM - 4:30 PM IDT

(5:45 AM - 6:30 AM PT)

Afternoon Break



4:30 PM - 5 PM IDT

(6:30 AM - 7 AM PT)
Invited Talk [YouTube | Bilibili]
Kate Saenko (Boston University)

Title: Data Shift Happens, What To Do About It?
Abstract [Expand]
In computer vision, generalization of learned representations is usually measured on i.i.d. data. This hides the fact that models often struggle to generalize to non-i.i.d data and fail to overcome the biases inherent in visual datasets. Labeling additional data in each new situation is the standard solution but is often prohibitively expensive. I will discuss some recent work in my lab addressing the core challenges in overcoming dataset bias, including adaptation to natural domain shifts, sim2real transfer, avoiding spurious correlations, and the role of pretraining in generalizability.
5 PM - 5:30 PM IDT

(7 AM - 7:30 AM PT)
Invited Talk [YouTube | Bilibili]
Yin Cui (Google)

Title: Open-Vocabulary Visual Perception upon Frozen Vision and Language Models
Abstract [Expand] [Slides]
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs has become a promising paradigm for open-vocabulary visual perception. In our recent explorations, we developed open-vocabulary models for detection based on distilling VLMs on existing detection data (ViLD), and for segmentation based on aligning image regions with image captions (OpenSeg). In this talk, I will focus on how to greatly simplify the paradigm by directly building upon frozen VLMs like CLIP with minimal modifications. In the first part, I will present our open-vocabulary detection model F-VLM that achieves state-of-the-art performance on the LVIS benchmark by only training a light-weight detector head. In the second part, I will show how we leverage motion and audio to help video generalize better to novel classes. Our model MOV encodes video, audio and flow with the same pre-trained CLIP’s vision encoder (frozen for video). We design an asymmetrical cross-attention module to aggregate multimodal information. MOV achieves state-of-the-art performance on UCF and HMDB, outperforming both traditional zero-shot methods and recent CLIP-based adaptation methods.
5:30 PM - 6 PM IDT

(7:30 AM - 8 AM PT)
Invited Talk [YouTube | Bilibili]
Stella Yu (University of Michigan)

Title: Learning Mid-Level Vision from Nothing but Data
Abstract [Expand]
Computer vision with deep learning has achieved super-human performance on various benchmarks. However, deep neural network models are highly specialized for the task and the data they are trained on. In contrast, babies with normal vision eventually all learn to see from their widely different visual experiences. I attribute this fascinating development of universal visual perception to the ability of learning mid-level visual representations from data without any external supervision. I will present our recent work on unsupervised learning of visual recognition from unlabeled videos and images, demonstrating that structures in the visual data can be discovered from nothing but data with minimal priors and model bottlenecks.

Accepted Papers

Computer Vision in the Wild Challenges

    There are two challenges associated with this workshop: "Image Classification in the Wild" (ICinW) and "Object Detection in the Wild" (ODinW). We summarize their evaluation datasets and metrics in the table below.

    Challenge
    Eval Datasets
    Eval Metrics
    Instructions
    Make a Challenge Submission
    ICinW
    20 Image Classification Datasets
    Zero, few, full-shot
    ODinW
    35 Object Detection Datasets
    Zero, few, full-shot
    To prevent a race purely in pre-training data and model size, we will have two tracks.
  • For the academic track, pre-training data is limited: (1) ICinW: ImageNet21K (Removing ImageNet1K), CC3M+CC12M, YFCC15M; (2) ODinW: Objects365.
  • For the industry track, there is no limitation on pre-training data and model size. Teams are required to disclose meta info of model and data if extra data is used. Here are some publicly available image-text datasets: (1) FLAVA Public Multimodal Datasets (PMD) corpus with 70M pairs; (2) LAION with 400M or 5B pairs.

  • Please see the submission pages for detailed requirements in each Challenge -> Track -> Phase. More information about the challenge benchmark is released: [Benchmark] [Document] [Data Download]. Please reach out if you have any issue in submissions.

    Please add a link to your participant team so that others can associate the submissions on the leaderboard with its corresponding team and work.
    • Click "Participation Teams" on the left side-bar
    • Find your participation team for the submission, click edit (the pencil icon)
    • Update "Team URL (Optional)" with your paper ArXiv link, GitHub page, etc.

Dates

July 25, 2022 Competition starts, testing phase begins
October 14, 2022 Competition ends (challenge paper submission)
October 14, 2022 (Extended) Workshop paper submission deadline
October 17,2022 Workshop paper acceptance decision to authors
October 20,2022 Camera-ready submission deadline


Workshop Organizers



Chunyuan Li
Microsoft



Jyoti Aneja
Microsoft



Jianwei Yang
Microsoft



Xin Wang
Microsoft



Pengchuan Zhang
Meta AI



Haotian Liu
UW Madison



Haotian Zhang
University of Washington



Liunian Li
UCLA



Aishwarya Kamath
NYU


Challenge Organizers



Yinfei Yang
Apple



Yi-Ting Chen
Google



Ye Xia
Google



Yangguang Li
Sensetime



Feng Liang
UT Austin



Yufeng Cui
Sensetime



Ping Jin
Microsoft



Shohei Ono
Microsoft



Houwen Peng
Microsoft



Saining Xie
NYU/Meta



Han Hu
Microsoft



Amanpreet Singh
HuggingFace



Xiaojie Jin
Bytedance



Jiashi Feng
Bytedance



Junyang Lin
Alibaba



An Yang
Alibaba



Peng Wang
Alibaba



Nguyen Bach
Microsoft



Yuning Lu
USTC



Yuanhan Zhang
NTU



Kaiyang Zhou
NTU



Ziwei Liu
NTU



Shilong Liu
Tsinghua University



Feng Li
HKUST



Hao Zhang
HKUST



Jianfeng Wang
Microsoft



Lijuan Wang
Microsoft



Xuehai He
UCSC



Xin Eric Wang
UCSC



Changyou Chen
University at Buffalo, SUNY



Yi Xu
Amazon



Haoxuan You
Columbia University


Advisory Committee



Trevor Darrell
UC Berkley



Lei Zhang
IDEA



Yong Jae Lee
UW Madison



Houdong Hu
Microsoft



Zicheng Liu
Microsoft



Ce Liu
Microsoft



Xuedong Huang
Microsoft



Kai-Wei Chang
UCLA



Jingdong Wang
Baidu



Zhuowen Tu
UCSD



Jianfeng Gao
Microsoft



Jenq-Neng Hwang
University of Washington



Yann LeCun
NYU/Meta


Disclaimer: To ensure fair comparisons in the challenge, the evaluation server and leaderboards are independently developed and maintained by the Workshop Organizers, while the Challege Organizers actively promote and contribute to the competitions.

Workshop and Challenge Questions?
Reach out: https://github.com/Computer-Vision-in-the-Wild/eccv-2022
Workshop Organizing Team