The 5th Workshop on Computer Vision in the Wild
Theme: Building Multimodal AI Agents with Verbal, Spatial and Temporal Intelligence
Date: June 3-4 | Location: Denver Convention Center, Denver CO
Overview
As artificial intelligence continues to evolve, the intersection of vision, language, and action is becoming central to systems that must operate reliably in unconstrained, real world settings. The 5th Workshop on Computer Vision in the Wild (CVinW) at CVPR 2026 aims to bring together researchers and practitioners advancing multimodal AI agents that can perceive, reason, and act across digital and physical environments, while highlighting the capabilities where today’s models still fall short. Building on the success of our previous workshops: CVPR 2025 CVinW Workshop, CVPR 2024 CVinW Workshop, CVPR 2023 CVinW Workshop and ECCV 2022 CVinW Workshop, this year’s edition focuses on the intersection of large multimodal models (LMMs) and vision-language-action (VLA) systems, with particular emphasis on fine grained spatiotemporal reasoning, causal inference, long horizon planning and memory, and robust tool use. The workshop focuses on moving beyond static understanding toward agents that perceive, reason, and act in dynamic environments, including interactive digital settings and embodied physical interaction.
Image source: Vision-Language Pre-training: Basics, Recent Advances, and Future Trends and Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Over the past years, we have witnessed remarkable advancements in open-vocabulary visual comprehension models and multimodal learning. Five years ago, vision-language models or multimodal models are mostly built on top of the BERT architecture. Typically, these models contain less than 1B parameters, and trained with a small amount of images. Some representative works are like ViLBERT, UNITER, and VisualBERT, etc. They are mostly used for image-text matching tasks such as visual question answering (VQA) and image captioning. Later on, we have seen the emergence of multimodal vision foundation models, such as CLIP, ALIGN and Florence. It scaled up the multimodal training to billions of images. Despite the model size is still relatively small, it shows strong open-vocabulary and zero-shot recognition capability across a wide range of visual domains. These strong capabilities have been further transferred to fine-grained core vision tasks such as object detection (e.g., M-DETR, ViLD, GLIP, RegionCLIP, GroundingDINO, OWL-ViT, etc), image segmentations (e.g., X-Decoder, SegGPT, SEEM, SAM, LISA, etc). Most recently, we entered the era of large multimodal models. Connecting the multimodal vision models such as CLIP with large language models such as Flamingo, Gemini, GPT-4V, leading to many advanced multimodal capability. Now we can have a multimodal chatbot such as GPT-4o, LLaVA-OneVision, Qwen-2.5-VL and Phi-4-Multimodal, which can see, talk and reasoning
Despite these successes, current vision models still lack the ability to fully grasp temporal dynamics, causal reasoning, and embodied interactions—key elements for autonomous agents that can see, reason, and act. Some recent works have attempted to address these challenges by building agentic models and VLA models. Our workshop aims to bringing together leading researchers, practitioners, and industry experts to discuss these emerging trends, challenges, and solutions in the field.
Highlights
(1) Invited Talks from leading experts in academia and industry on the latest advancements in multimodal AI.
(2) Paper Presentations showcasing cutting-edge research contributions in computer vision in the wild.
(3) Panel Discussions (Tentative) exploring the future of vision-language-action models and their impact on robotics, autonomous systems, and real-world AI applications.
We invite researchers, engineers, and enthusiasts to join us in shaping the future of vision systems that go beyond static image recognition to dynamic, interactive, and real-world AI applications. Stay tuned for more details on speakers, paper submissions, and challenge participation!
For more information, visit our official workshop page and explore our CVinW reading list: 📌 CVinW Readings
Invited Speakers
Tentative Schedule (June 3-4, Wednesday - Thursday)
|
|
![]() |
Welcome
Reuben Tan Bio [Expand]
TODO
|
|
|
![]() |
Invited Talk
Manling Li Title: TODO Abstract [Expand]
TODO
Bio [Expand]
TODO
|
|
|
![]() |
Invited Talk
Chelsea Finn Title: TODO Abstract [Expand]
TODO
Bio [Expand]
TODO
|
|
|
Workshop Paper Presentations |
|
|
|
Afternoon Break and Poster Session
|
Poster boards: TODO
|
|
|
![]() |
Invited Talk
Xiaolong Wang Title: TODO Abstract [Expand]
TODO
Bio [Expand]
TODO
|
|
|
![]() |
Invited Talk
Mohit Bansal Title: TODO Abstract [Expand]
TODO
Bio [Expand]
TODO
|
|
|
![]() |
Invited Talk
Kate Saenko Title: TODO Abstract [Expand]
TODO
Bio [Expand]
TODO
|
|
|
Panel Discussion + Closing Remarks
|
Moderator
Jianfeng Gao Bio [Expand]
TODO
|
Call for Papers
We welcome original contributions that advance the state of the art in vision-language learning, multimodal perception, and embodied AI, particularly in unconstrained, real-world environments. Topics of interest include, but are not limited to:
- LMMs & Vision-Language Systems: Open-vocabulary learning, multimodal pretraining, and adaptation.
- Video Understanding & Temporal Reasoning: Long-range video modeling, causal reasoning, and instruction-following.
- VLA & Embodied AI: Multimodal action learning, simulation-to-real transfer, and robotic perception.
- Foundation Models for Vision Tasks: Object detection, segmentation, tracking, and fine-grained recognition in the wild.
- Efficient Training Methods: Large visual model adaptation methods, measured by #training samples (zero-shot and few-shot), #trainable parameters, throughput, training cost
- New Metrics and Benchmarks: Novel ways to evaluate existing LMMs and large vision models for task-level transfer and open-set visual recognition.
We accept abstract submissions to our workshop. All submissions shall have maximally 8 pages (excluding references) following the CVPR 2025 author guidelines. All submissions will be reviewed by the Program Committee on the basis of technical quality, relevance to scope of the conference, originality, significance, and clarity. Ther review process is double-blind, and the accepted papers are NOT archived in CVPR 2025 Proceeding.
Workshop Paper Submission Portal:
[Open Review]
![]()
Submission Deadline:
May 16th, 2026
Acceptance Notification:
May 23rd, 2026
Camera-ready Submission
May 30th, 2026
Call for Challenge Submissions
We introduce two new challenges to evaluate the performance of large vision models in the wild:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
June 1st, 2026
Competition ends
June 6th, 2026
Invitation to present at workshop
