The 4th Workshop on Computer Vision in the Wild
Theme: Building Multimodal AI with both Verbal and Spatial Intelligence
Date: June 11 or 12 | Location: Music City Center, Nashville TN
Overview
As artificial intelligence continues to evolve, the intersection of vision and language models is becoming increasingly crucial for real-world applications. The 4th Workshop on Computer Vision in the Wild (CVinW) at CVPR 2025 aims to foster discussions and innovations that push the boundaries of computer vision systems in unconstrained environments. Building on the success of our previous workshops: CVPR 2024 CVinW Workshop, CVPR 2023 CVinW Workshop and ECCV 2022 CVinW Workshop, this edition will focus on the next generation of large multimodal models (LMMs) and vision-language-action (VLA) systems, with an emphasis on temporal reasoning, video understanding, and physical interaction.

Image source: Vision-Language Pre-training: Basics, Recent Advances, and Future Trends and Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Over the past years, we have witnessed remarkable advancements in open-vocabulary visual comprehension models and multimodal learning. Five years ago, vision-language models or multimodal models are mostly built on top of the BERT architecture. Typically, these models contain less than 1B parameters, and trained with a small amount of images. Some representative works are like ViLBERT, UNITER, and VisualBERT, etc. They are mostly used for image-text matching tasks such as visual question answering (VQA) and image captioning. Later on, we have seen the emergence of multimodal vision foundation models, such as CLIP, ALIGN and Florence. It scaled up the multimodal training to billions of images. Despite the model size is still relatively small, it shows strong open-vocabulary and zero-shot recognition capability across a wide range of visual domains. These strong capabilities have been further transferred to fine-grained core vision tasks such as object detection (e.g., M-DETR, ViLD, GLIP, RegionCLIP, GroundingDINO, OWL-ViT, etc), image segmentations (e.g., X-Decoder, SegGPT, SEEM, SAM, LISA, etc). Most recently, we entered the era of large multimodal models. Connecting the multimodal vision models such as CLIP with large language models such as Flamingo, Gemini, GPT-4V, leading to many advanced multimodal capability. Now we can have a multimodal chatbot such as GPT-4o, LLaVA-OneVision, Qwen-2.5-VL and Phi-4-Multimodal, which can see, talk and reasoning
Despite these successes, current vision models still lack the ability to fully grasp temporal dynamics, causal reasoning, and embodied interactions—key elements for autonomous agents that can see, reason, and act. Some recent works have attempted to address these challenges by building agentic models and VLA models. Our workshop aims to bringing together leading researchers, practitioners, and industry experts to discuss these emerging trends, challenges, and solutions in the field.
Highlights
(1) Invited Talks from leading experts in academia and industry on the latest advancements in multimodal AI.
(2) Paper Presentations showcasing cutting-edge research contributions in computer vision in the wild.
(3) Panel Discussions (Tentative) exploring the future of vision-language-action models and their impact on robotics, autonomous systems, and real-world AI applications.
We invite researchers, engineers, and enthusiasts to join us in shaping the future of vision systems that go beyond static image recognition to dynamic, interactive, and real-world AI applications. Stay tuned for more details on speakers, paper submissions, and challenge participation!
For more information, visit our official workshop page and explore our CVinW reading list: 📌 CVinW Readings
Invited Speakers
Call for Papers
We welcome original contributions that advance the state of the art in vision-language learning, multimodal perception, and embodied AI, particularly in unconstrained, real-world environments. Topics of interest include, but are not limited to:
- LMMs & Vision-Language Systems: Open-vocabulary learning, multimodal pretraining, and adaptation.
- Video Understanding & Temporal Reasoning: Long-range video modeling, causal reasoning, and instruction-following.
- VLA & Embodied AI: Multimodal action learning, simulation-to-real transfer, and robotic perception.
- Foundation Models for Vision Tasks: Object detection, segmentation, tracking, and fine-grained recognition in the wild.
- Efficient Training Methods: Large visual model adaptation methods, measured by #training samples (zero-shot and few-shot), #trainable parameters, throughput, training cost
- New Metrics and Benchmarks: Novel ways to evaluate existing LMMs and large vision models for task-level transfer and open-set visual recognition.
We accept abstract submissions to our workshop. All submissions shall have maximally 8 pages (excluding references) following the CVPR 2025 author guidelines. All submissions will be reviewed by the Program Committee on the basis of technical quality, relevance to scope of the conference, originality, significance, and clarity. Ther review process is double-blind, and the accepted papers are NOT archived in CVPR 2025 Proceeding.
Workshop Paper Submission Portal:
[Open Review]
Submission Deadline:
May 2nd, 2025
Acceptance Notification:
May 16th, 2025
Camera-ready Submission
May 30th, 2025
Question? Reach out Workshop Organizing Team