Workshop on Computer Vision in the Wild
@ ECCV 2022, October 23 Virtual Meeting (Video List on YouTube and Bilibili)
9:00am-6:00pm Israeli Time || 11:00pm (October 22)-8:00am Pacific Time || 2:00pm-11:00pm Beijing Time

Overview

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept.

Recent works show that learning from large-scale image-text data is a promising approach to building transferable visual models that can effortlessly adapt to a wide range of downstream computer vision (CV) and multimodal (MM) tasks. For example, CLIP , ALIGN and Florence for image classification, ViLD , RegionCLIP and GLIP for object detection. These vision models with language interface are naturally open-vocabulary recogntion models, showing superior zero-shot and few-shot adaption performance on various real-world scenarios.

We propose this "Computer Vision in the Wild (CVinW)" workshop, aiming to gather academic and industry communities to work on CV problems in real-world scenarios, focusing on the challenge of open-set/domain visual recognition and efficient task-level transfer. Since there is no established benchmarks to measure the progress of "CV in the Wild", we develop new benchmarks for image classification and object detection, to measure the task-level transfer ablity of various models/methods over diverse real-world datasets, in terms of both prediction accuracy and adaption efficiency. This workshop will also host two challenges based on the ELEVATER benchmarks.

For those who are new to this topic, please check out the CVinW Reading List.

The 2nd CVinW Workshop & Challenge at CVPR 2023 Related Call for Paper (Deadline: April 1, 2023): International Journal of Computer Vision (IJCV) special issue on ``Promises and Dangers of Large Vision Models''

Invited Speakers

Kate Saenko
Boston University

Stella Yu
University of Michigan

Ishan Misra
META

Yin Cui
Google

Yongqin Xian
Google

Xiaolong Wang
UC San Diego

Matthias Minderer
Google

Xiaohua Zhai
Google

Neil Houlsby
Google

Program (Video List on YouTube and Bilibili)

9 AM - 9:15 AM IDT (11 PM - 11:15 PM PT - 10/22)		Welcome [YouTube \| Bilibili] Jianfeng Gao - Microsoft Research
9:15 AM - 9:45 AM IDT (11:15 PM - 11:45 PM PT - 10/22)		Invited Talk [YouTube \| Bilibili] Xiaolong Wang (UC San Diego) Title: Robot Perception in the Wild Abstract [Expand] Visual representation learning has achieved tremendous success in semantic and geometric understanding. But how can this knowledge help robots interact with the physical world and operate in the wild? In this talk, I will introduce our studies on learning rich and generalizable visual representations for such a purpose. Specifically, I will talk about our work on learning open vocabulary semantic representations with only text supervision, and 3D object representations from videos in the wild using self-supervision. At the end of the talk, I will briefly show demos of how these techniques can help improve real-world robotics tasks including dexterous hand manipulation and legged robot locomotion control.
9:45 AM - 10:15 AM IDT (11:45 PM - 12:15 PM PT - 10/22)		Invited Talk: Google Brain Zurich Team Tutorial (1/3) [YouTube \| Bilibili] Neil Houlsby (Google) Title: Architectures Beyond CNNs and Visual Scaling Laws Abstract [Expand] I will present some of our work that has explored the capabilities of non-convolutional architectures for Computer Vision, such as Transformers, Mixers, and Mixture-of-expert based models. These architectures often demonstrate favourable properties in the context of transfer learning from a large source dataset to a small target datasets. In this context, I will discuss our exploration into these models' scaling laws, improved scaling law estimators, and the apparent saturation of larger vision models.
10:15 AM - 10:45 AM IDT (12:15 AM - 12:45 AM PT)		Invited Talk: Google Brain Zurich Team Tutorial (2/3) [YouTube \| Bilibili] Xiaohua Zhai (Google) Title: Scaling Vision and Language Learning with Vision Transformers Abstract [Expand] Attention-based neural networks such as the Vision Transformer (ViT) have recently achieved state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in obtaining excellent results. In this talk, I will first present how to train a text model to "read out'' good representations from a pre-trained and locked Vision Transformer model for new tasks, named "Locked-image Tuning" (LiT). A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. Then I will discuss how to use the mixture of experts model (LIMoE) for contrastive learning, which accepts both images and text simultaneously. Finally I will share how to reuse pre-trained ViT models and pre-trained encoder-decoder language models in PaLI, to support vision and language downstream tasks such as captioning and VQA across over 100 languages. When scaling up the PaLI visual component from ViT-G to ViT-e, we observed a significant boost on vision and language tasks.
10:45 AM - 11:15 AM IDT (12:45 PM - 1:15 AM PT)		Invited Talk: Google Brain Zurich Team Tutorial (3/3) [YouTube \| Bilibili] Matthias Minderer (Google) Title: Beyond image-level tasks: Scalable approaches to localization and dense prediction Abstract [Expand] Scaling of vision-language models has been very successful for image-level tasks such as classification and retrieval. I will present work on extending this approach beyond the image level, to structured tasks such as detection, segmentation, depth estimation, and colorization. I will first discuss a simple architecture and recipe for transferring vision-language models to open-vocabulary object detection (OWL-ViT). Second, I will discuss a general image modeling approach which combines a language and an image model to unify many dense prediction tasks (UViM). In this approach, the language model learns to represent structured, interdependent data features, while the image model efficiently deals with high-dimensional pixel-wise output. These methods are competitive with the respective state-of-the-art methods, while being simpler and more general.
11:15 AM - 12 PM IDT (1:15 AM - 2 AM PT)		Challenge Summary on ICinW and ODinW [YouTube \| Bilibili] Chunyuan Li (Microsoft Research) [Slides] Winner team presentations (Decided by October 20, 2022) [YouTube \| Bilibili] - ICinW Industry Track \| Chinese CLIP \| Junyang Lin (Alibaba) - ICinW Academic Track \| K-LITE \| Sheng Shen (University of California, Berkeley) - ICinW ImageNet-1K in Pre-training \| Bamboo \| Yuanhan Zhang (Nanyang Technological University) - ICinW Parameter-Efficiency \| ProDA \| Yuning Lu (University of Science and Technology of China) - ODinW Zero-Shot Track \| DetCLIP \| Jianhua Han (Huawei) - ODinW Full-Shot Track \| DINO \| Shilong Liu (IDEA & Tsinghua), Hao Zhang (IDEA & HKUST)
12 PM - 1:30 PM IDT (2 AM - 3:30 AM PT)	Lunch Break
1:30 PM - 2 PM IDT (3:30 AM - 4 AM PT)		Invited Talk [YouTube \| Bilibili] Title: General purpose visual recognition systems: beyond a single modality and a task Ishan Misra (Meta) Abstract [Expand] Modern computer vision models are good at specialized tasks. Given the right architecture, right supervision, supervised learning can yield great specialist models. However, specialist models also have severe limitations — they can only do what they are trained for and require copious amounts of pristine supervision for it. In this talk, I’ll focus on two limitations: specialist models cannot work on tasks beyond what they saw training labels for, or on new types of visual data. I’ll present our recent efforts that design better architectures, training paradigms and loss functions to address these issues. Our first line of work, called Omnivore, presents a single model that can operate on images, videos, and single-view 3D data. Omnivore leads to shared representations across visual modalities, without using paired input data. Omnivore can also be trained in a self-supervised manner. I'll conclude the talk with general purpose detection and segmentation models. We developed Detic, a simple way to train large-vocabulary detectors using image-level labs which leads to a 20,000+ class detector. We also proposed Mask2Former which is a single meta architecture for all types of image and video segmentation tasks.
2 PM - 2:30 PM IDT (4 AM - 4:30 AM PT)		Invited Talk [YouTube \| Bilibili] Title: Learning Unsupervised Semantic Embeddings for Zero-Shot Image Classification Yongqin Xian (Google) Abstract [Expand] Despite the tremendous progress in zero-shot learning (ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this talk, I will first present visually-grounded semantic embedding (VGSE) that enhances the word embeddings by mapping them into a latent space learned by image regions clustering. In the second part, I will introduce Image to Document Transformer (I2DFormer), a new transformer-based ZSL framework that learns unsupervised semantic embeddings from images and class-level online textual documents, e.g., Wikipedia. We empirically show that both methods significantly outperform previous unsupervised semantic embeddings on three public datasets and lead to highly interpretable results.
2:30 PM - 3:45 PM IDT (4:30 AM - 5:45 AM PT)		Spotlight Paper Presentations [YouTube \| Bilibili] - MS-CLIP \| Haoxuan You (Columbia University) - CLIP understands Texture? \| Chenyun Wu (UMass Amherst) - Synthetic Data for Infrequent OD \| Ninad Kulkarni (Amazon Web Services) - XMem \| Ho Kei Cheng (University of Illinois Urbana-Champaign) - AUGCO \| Viraj Prabhu (Georgia Tech) - Unsupervised Selective Labeling \| Long Lian (UC Berkeley) - Diffusion Models for Outfit Rendering \| Vignesh Srinivasan (Zalando Research) - OmDet \| Tiancheng Zhao (Binjiang Institute of Zhejiang University)
3:45 PM - 4:30 PM IDT (5:45 AM - 6:30 AM PT)	Afternoon Break
4:30 PM - 5 PM IDT (6:30 AM - 7 AM PT)		Invited Talk [YouTube \| Bilibili] Kate Saenko (Boston University) Title: Data Shift Happens, What To Do About It? Abstract [Expand] In computer vision, generalization of learned representations is usually measured on i.i.d. data. This hides the fact that models often struggle to generalize to non-i.i.d data and fail to overcome the biases inherent in visual datasets. Labeling additional data in each new situation is the standard solution but is often prohibitively expensive. I will discuss some recent work in my lab addressing the core challenges in overcoming dataset bias, including adaptation to natural domain shifts, sim2real transfer, avoiding spurious correlations, and the role of pretraining in generalizability.
5 PM - 5:30 PM IDT (7 AM - 7:30 AM PT)		Invited Talk [YouTube \| Bilibili] Yin Cui (Google) Title: Open-Vocabulary Visual Perception upon Frozen Vision and Language Models Abstract [Expand] [Slides] Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs has become a promising paradigm for open-vocabulary visual perception. In our recent explorations, we developed open-vocabulary models for detection based on distilling VLMs on existing detection data (ViLD), and for segmentation based on aligning image regions with image captions (OpenSeg). In this talk, I will focus on how to greatly simplify the paradigm by directly building upon frozen VLMs like CLIP with minimal modifications. In the first part, I will present our open-vocabulary detection model F-VLM that achieves state-of-the-art performance on the LVIS benchmark by only training a light-weight detector head. In the second part, I will show how we leverage motion and audio to help video generalize better to novel classes. Our model MOV encodes video, audio and flow with the same pre-trained CLIP’s vision encoder (frozen for video). We design an asymmetrical cross-attention module to aggregate multimodal information. MOV achieves state-of-the-art performance on UCF and HMDB, outperforming both traditional zero-shot methods and recent CLIP-based adaptation methods.
5:30 PM - 6 PM IDT (7:30 AM - 8 AM PT)		Invited Talk [YouTube \| Bilibili] Stella Yu (University of Michigan) Title: Learning Mid-Level Vision from Nothing but Data Abstract [Expand] Computer vision with deep learning has achieved super-human performance on various benchmarks. However, deep neural network models are highly specialized for the task and the data they are trained on. In contrast, babies with normal vision eventually all learn to see from their widely different visual experiences. I attribute this fascinating development of universal visual perception to the ability of learning mid-level visual representations from data without any external supervision. I will present our recent work on unsupervised learning of visual recognition from unlabeled videos and images, demonstrating that structures in the visual data can be discovered from nothing but data with minimal priors and model bottlenecks.

Accepted Papers

How well does CLIP understand texture? Chenyun Wu, Subhransu Maji
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model Ho Kei Cheng, Alexander Schwing
Matryoshka Representations for Adaptive Deployment Aniket Rege, Aditya Kusupati, Gantavya Bhatt, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, Ali Farhadi
Visual - Semantic Contrastive Alignment for Few-Shot Image Classification Mohamed Afham Mohamed Aflal, Ranga Rodrigo
AUGCO: Augmentation Consistency-guided Self-training for Source-free Domain Adaptive Semantic Segmentation Viraj Prabhu, Shivam Khare, Deeksha Kartik, Judy Hoffman
Unsupervised Selective Labeling for More Effective Semi-Supervised Learning Xudong Wang, Long Lian, Stella X Yu
OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training Tiancheng Zhao, Peng Liu, Xiaopeng Lu, Qianqian Zhang, Kyusong Lee, Tianqi Zhang, Mingwei Zhu, Haozhan Shen
Diffusion Models for Outfit Rendering: Novel Conditioning Architectures for Subject-driven Generation Vignesh Srinivasan, Nikolay Jetchev, Martin Heusel, Tofigh Naghibi
Domain-Compatible Synthetic Data Generation for Infrequent Objects Detection Negin Sokhandan, Ninad D Kulkarni, Yash Shah, Suchitra Sathyanarayana
MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training Haoxuan You*, Luowei Zhou*, Bin Xiao*, Noel Codella*, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan

Computer Vision in the Wild Challenges

Challenge	Eval Datasets	Eval Metrics	Instructions	Make a Challenge Submission
ICinW	20 Image Classification Datasets	Zero, few, full-shot
ODinW	35 Object Detection Datasets	Zero, few, full-shot

For the academic track, pre-training data is limited: (1) ICinW: ImageNet21K (Removing ImageNet1K), CC3M+CC12M, YFCC15M; (2) ODinW: Objects365.
For the industry track, there is no limitation on pre-training data and model size. Teams are required to disclose meta info of model and data if extra data is used. Here are some publicly available image-text datasets: (1) FLAVA Public Multimodal Datasets (PMD) corpus with 70M pairs; (2) LAION with 400M or 5B pairs.

[Benchmark]

[Document]

[Data Download]

Please add a link to your participant team so that others can associate the submissions on the leaderboard with its corresponding team and work.

Click "Participation Teams" on the left side-bar
Find your participation team for the submission, click edit (the pencil icon)
Update "Team URL (Optional)" with your paper ArXiv link, GitHub page, etc.

Dates

July 25, 2022 Competition starts, testing phase begins October 14, 2022 Competition ends (challenge paper submission) October 14, 2022 (Extended) Workshop paper submission deadline October 17,2022 Workshop paper acceptance decision to authors October 20,2022 Camera-ready submission deadline

Workshop Organizers

Chunyuan Li
Microsoft

Jyoti Aneja
Microsoft

Jianwei Yang
Microsoft

Xin Wang
Microsoft

Pengchuan Zhang
Meta AI

Haotian Liu
UW Madison

Haotian Zhang
University of Washington

Liunian Li
UCLA

Aishwarya Kamath
NYU

Challenge Organizers

Yinfei Yang
Apple

Yi-Ting Chen
Google

Ye Xia
Google

Yangguang Li
Sensetime

Feng Liang
UT Austin

Yufeng Cui
Sensetime

Ping Jin
Microsoft

Shohei Ono
Microsoft

Houwen Peng
Microsoft

Saining Xie
NYU/Meta

Han Hu
Microsoft

Amanpreet Singh
HuggingFace

Xiaojie Jin
Bytedance

Jiashi Feng
Bytedance

Junyang Lin
Alibaba

An Yang
Alibaba

Peng Wang
Alibaba

Nguyen Bach
Microsoft

Yuning Lu
USTC

Yuanhan Zhang
NTU

Kaiyang Zhou
NTU

Ziwei Liu
NTU

Shilong Liu
Tsinghua University

Feng Li
HKUST

Hao Zhang
HKUST

Jianfeng Wang
Microsoft

Lijuan Wang
Microsoft

Xuehai He
UCSC

Xin Eric Wang
UCSC

Changyou Chen
University at Buffalo, SUNY

Yi Xu
Amazon

Haoxuan You
Columbia University

Advisory Committee

Trevor Darrell
UC Berkley

Lei Zhang
IDEA

Yong Jae Lee
UW Madison

Houdong Hu
Microsoft

Zicheng Liu
Microsoft

Ce Liu
Microsoft

Xuedong Huang
Microsoft

Kai-Wei Chang
UCLA

Jingdong Wang
Baidu

Zhuowen Tu
UCSD

Jianfeng Gao
Microsoft

Jenq-Neng Hwang
University of Washington

Yann LeCun
NYU/Meta

Disclaimer: To ensure fair comparisons in the challenge, the evaluation server and leaderboards are independently developed and maintained by the Workshop Organizers, while the Challege Organizers actively promote and contribute to the competitions.

Workshop and Challenge Questions?
Reach out: https://github.com/Computer-Vision-in-the-Wild/eccv-2022
Workshop Organizing Team