ECCV 2024

European Conference on Computer Vision | Sept. 29 – Oct. 4, 2024

The European Conference on Computer Vision (ECCV) is a biennial premier research conference in Computer Vision and Machine Learning. It is held on even years and gathers the scientific and industrial communities on these areas.

Discover Georgia Tech’s experts and their solutions in advancing computer vision in an era of rapidly evolving computing technologies.

Computer vision research allows machines to interpret and analyze visual information, creating advancements in autonomous vehicles, medical imaging, security applications, and many sectors of the modern economy. Meet the Georgia Tech experts charting a path forward.

#ECCV2024

Main Conference Opens

Georgia Tech at ECCV 2024

Partner Organizations

Beijing Jiaotong University • Carnegie Mellon University • Cisco Research • Emory University • Georgia Tech • Google • Hong Kong University of Science and Technology • IIT Roorkee • Intel Labs • Meta • Stack AV • State University of New York at Buffalo • The University of Hong Kong • Toyota Research Institute • University of Illinois Urbana-Champaign • University of Michigan • University of Pennsylvania • University of Waterloo • University of Wisconsin Madison

Faculty with number of papers 🔗


FEATURED

BIRD’S EYE VIEW: Skyscenes Dataset Could Lead to Safe, Reliable Autonomous Flying Vehicles

By Nathan Deen

Is it a building or a street? How tall is the building? Are power lines nearby?

These are details autonomous flying vehicles need to know to function safely. However, few aerial image datasets exist that can adequately train the computer vision algorithms that would pilot these vehicles.

That’s why Georgia Tech researchers created a new benchmark dataset of computer-generated aerial images.

Judy Hoffman, an assistant professor in Georgia Tech’s School of Interactive Computing, worked with students in her lab to create SKYSCENES. The dataset contains over 33,000 aerial images of cities curated from a computer simulation program.

Distributed Data Intensive Systems Lab

New Technique Empowers Users to Combat Unauthorized Facial Recognition by ‘Masking’ Images Before Sharing 🔗

• • •

Face recognition (FR) through networked systems can be abused for privacy intrusion. Governments, private companies, or even individual attackers can collect facial images by scraping the web to build an FR system identifying human faces without a person’s consent.

New research from Georgia Tech and the University of Hong Kong introduces Chameleon, which learns to generate a user-centric personalized privacy protection mask, coined a P3-Mask, to protect facial images against unauthorized facial recognition.

The team used cross-image optimization to generate one P3-Mask that is suitable to apply to all images of a single person’s face. This enables efficient and instant protection even for users with limited computing resources.

The team incorporated “perceptibility optimization” to preserve the visual quality of the protected facial images. Researchers also strengthened the robustness of the mask by integrating a diversity of models into the mask generation process. Extensive experiments on two benchmark datasets show that Chameleon outperforms three state-of-the-art methods with instant protection and minimal degradation of image quality.

For non-technical users looking for online protection, Chameleon enables cost-effective FR authorization using the P3-Mask, and it demonstrates high resilience against adaptive adversaries.

This research proposes a technique called Chameleon to generate a personalized privacy protection ‘mask’ for images. By adding such a mask prior to publishing facial images, it can safeguard your photos against unauthorized facial recognition.

• • •

The P3-Mask can be applied to any facial images of the same person while preserving image quality.
The Chameleon team 🔼 is one of four with faculty from the School of Computer Science at Georgia Tech. Team members, by author order, include alumnus Ka-Ho Chow, PhD CS 23, and current Tech researchers Sihao Hu, Tiansheng Huang, and Ling Liu.

Distributed Data Intensive Systems Lab

New Technique Empowers Users to Combat Unauthorized Facial Recognition by ‘Masking’ Images Before Sharing 🔗

• • •

Face recognition (FR) through networked systems can be abused for privacy intrusion. Governments, private companies, or even individual attackers can collect facial images by scraping the web to build an FR system identifying human faces without a person’s consent.

New research from Georgia Tech and the University of Hong Kong introduces Chameleon, which learns to generate a user-centric personalized privacy protection mask, coined a P3-Mask, to protect facial images against unauthorized facial recognition.

The team used cross-image optimization to generate one P3-Mask that is suitable to apply to all images of a single person’s face. This enables efficient and instant protection even for users with limited computing resources.

The team incorporated “perceptibility optimization” to preserve the visual quality of the protected facial images. Researchers also strengthened the robustness of the mask by integrating a diversity of models into the mask generation process. Extensive experiments on two benchmark datasets show that Chameleon outperforms three state-of-the-art methods with instant protection and minimal degradation of image quality.

For non-technical users looking for online protection, Chameleon enables cost-effective FR authorization using the P3-Mask, and it demonstrates high resilience against adaptive adversaries.

This research proposes a technique called Chameleon to generate a personalized privacy protection ‘mask’ for images. By adding such a mask prior to publishing facial images, it can safeguard your photos against unauthorized facial recognition.

• • •

The P3-Mask can be applied to any facial images of the same person while preserving image quality.
The Chameleon team 🔼 is one of four with faculty from the School of Computer Science at Georgia Tech. Team members, by author order, include alumnus Ka-Ho Chow, PhD CS 23, and current Tech researchers Sihao Hu, Tiansheng Huang, and Ling Liu.

Efficient and Intelligent Computing Lab

New 3D Reconstruction Method Could Pave the Way for Instant Scene Generation for Consumer and Professional Use 🔗

Ph.D. student Yonggan Fu

What It’s About: Our work, Omni-Recon, was accepted as an ECCV Oral Paper—the top 2 percent of papers—and draws inspiration from emerging foundational AI models. Omni-Recon aims to develop a general-purpose 3D reconstruction and understanding solution for instantly reconstructing any new 3D scene and handling diverse 3D understanding tasks in a “zero-shot manner,” where the AI has not been specifically trained or given examples for that exact task. It can also be adapted to various downstream 3D applications, such as real-time rendering and 3D scene editing.

Why generalizable and real-time 3D construction matters in daily life: Imagine trying on clothes in a virtual fitting room, where a 3D model of your body is created in real-time to show how garments fit from every angle. Or consider home renovation: you could use your smartphone to visualize how new flooring or furniture would look in your space before making a purchase. In retail, augmented reality (AR) apps can help shoppers see products in their homes before buying. In healthcare, doctors could create 3D models of organs during surgery for better precision and diagnosis. Real-time 3D construction enables these applications by making the virtual world feel more tangible and responsive, helping people make informed decisions and interact with their environment in novel ways.

How Omni-Recon is pushing 3D Reconstruction forward: Previous 3D models often require long wait times to reconstruct or understand a new scene, or rely on offline rendering without real-time interaction. Our work offers insights on how to achieve both simultaneously. Specifically, using a small set of 2D images captured by a camera, our method can instantly reconstruct the 3D scene, predict its geometry, understand its semantics, and support real-time rendering from novel viewpoints. This capability makes our framework beneficial for all the aforementioned 3D applications.

Yingyan “Celine” Lin

How the EIC Lab is advancing efficient 3D reconstruction: One of the research directions of the EIC Lab, directed by Assoc. Professor Yingyan “Celine” Lin in the School of Computer Science, is developing efficient 3D reconstruction solutions from both algorithmic and hardware perspectives. In addition to the lab’s ECCV 2024 Oral Paper, which presents an efficient 3D reconstruction algorithm, the group has also created advanced hardware architectures (accepted by MICRO’24, ISCA’23, DAC’23, and ICCAD’22) to enhance the training and inference efficiency of cutting-edge 3D reconstruction methods. Notably, one of these designs has been demonstrated through a taped-out chip, showcasing its practical impact. This work has been nominated as a best paper candidate at MICRO 2024, a first-tier computer architecture conference.

3×2: 3D Object Part Segmentation by 2D Semantic Correspondences
Anh Thai, Weiyao Wang, Hao Tang, Stefan Stojanov, James Rehg, Matt Feiszli

3D object part segmentation is essential in computer vision applications. While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D datasets, which are expensive to collect. In this work, we propose to leverage a few annotated 3D shapes or richly annotated 2D datasets to perform 3D object part segmentation. We present our novel approach, termed 3-By-2 that achieves SOTA performance on different benchmarks with various granularity levels. By using features from pretrained foundation models and exploiting semantic and geometric correspondences, we are able to overcome the challenges of limited 3D annotations. Our approach leverages available 2D labels, enabling effective 3D object part segmentation. Our method 3-By-2 can accommodate various part taxonomies and granularities, demonstrating interesting part label transfer ability across different object categories. We will release code to the community.


AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking
Yuheng Li, Tianyu Luan, Yizhou Wu, Shaoyan Pan, Yenho Chen, Xiaofeng Yang

Due to the scarcity of labeled data, self-supervised learning (SSL) has gained much attention in 3D medical image segmentation, by extracting semantic representations from unlabeled data. Among SSL strategies, Masked image modeling (MIM) has shown effectiveness by reconstructing randomly masked images to learn detailed representations. However, conventional MIM methods require extensive training data to achieve good performance, which still poses a challenge for medical imaging. Since random masking uniformly samples all regions within medical images, it may overlook crucial anatomical regions and thus degrade the pretraining efficiency. We propose AnatoMask, a novel MIM method that leverages reconstruction loss to dynamically identify and mask out anatomically significant regions to improve pretraining efficacy. AnatoMask takes a self-distillation approach, where the model learns both how to find more significant regions to mask and how to reconstruct these masked regions. To avoid suboptimal learning, Anatomask adjusts the pretraining difficulty progressively using a masking dynamics function. We have evaluated our method on 4 public datasets with multiple imaging modalities (CT, MRI and PET). AnatoMask demonstrates superior performance and scalability compared to existing SSL methods.


Benchmarking Object Detectors with COCO: A New Path Forward
Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, Karan Desai

The Common Objects in Context (COCO) dataset has been instrumental in benchmarking object detectors over the past decade. Like every dataset, COCO contains subtle errors and imperfections stemming from its annotation procedure. With the advent of high-performing models, we ask whether these errors of COCO are hindering its utility in reliably benchmarking further progress. In search for an answer, we inspect thousands of masks from COCO (2017 version) and uncover different types of errors such as imprecise mask boundaries, non-exhaustively annotated instances, and mislabeled masks. Due to the prevalence of COCO, we choose to correct these errors to maintain continuity with prior research. We develop COCO-ReM (Refined Masks), a cleaner set of annotations with visibly better mask quality than COCO-2017. We evaluate fifty object detectors and find that models that predict visually sharper masks score higher on COCO-ReM, affirming that they were being incorrectly penalized due to errors in COCO-2017. Moreover, our models trained using COCO-ReM converge faster and score higher than their larger variants trained using COCO-2017, highlighting the importance of data quality in improving object detectors. With these findings, we advocate using COCO-ReM for future object detection research. Our dataset is available at https://cocorem.xyz


CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs
Akshat Ramachandran, Souvik Kundu, Tushar Krishna

We present CLAMP-ViT, a data-free post-training quantization method for vision transformers (ViTs). We identify the limitations of recent data-free quantization techniques, notably their inability to leverage meaningful inter-patch relationships, leading to the generation of simplistic and semantically vague data, impacting quantization accuracy. CLAMP-ViT employs a two-stage approach, cyclically adapting between data generation and model quantization. Specifically, we incorporate a patch-level contrastive learning scheme to generate richer, semantically meaningful data. Furthermore, we leverage contrastive learning in layer-wise evolutionary search for fixed- and mixed-precision quantization to identify optimal quantization parameters while mitigating the effects of a non-smooth loss landscape. Extensive evaluations across various vision tasks demonstrate the superiority of CLAMP-ViT, with performance improvements of up to 3% in top-1 accuracy for classification, 0.6 mAP for object detection, and 1.5 mIoU for segmentation at similar or better compression ratio over existing alternatives


Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
Siyu Jiao, Hongguang Zhu, Yunchao Wei, Yao Zhao, Jiannan Huang, Humphrey Shi

Pre-trained vision-language models, e.g., CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions.  However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on the ADE20K dataset, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ.


Diffusion for Natural Image Matting
Yihan Hu, Yiheng Lin, Wei Wang, Yao Zhao, Yunchao Wei, Humphrey Shi

Existing natural image matting algorithms inevitably have flaws in their predictions on difficult cases, and their one-step prediction manner cannot further correct these errors. In this paper, we investigate a multi-step iterative approach for the first time to tackle the challenging natural image matting task, and achieve excellent performance by introducing a pixel-level denoising diffusion method (DiffMatte) for the alpha matte refinement. To improve iteration efficiency, we design a lightweight diffusion decoder as the only iterative component to directly denoise the alpha matte, saving the huge computational overhead of repeatedly encoding matting features. We also propose an ameliorated self-aligned strategy to consolidate the performance gains brought about by the iterative diffusion process. This allows the model to adapt to various types of errors by aligning the noisy samples used in training and inference, mitigating performance degradation caused by sampling drift. Extensive experimental results demonstrate that DiffMatte not only reaches the state-of-the-art level on the mainstream Composition-1k test set, surpassing the previous best methods by 8% and 15 in the SAD metric and MSE metric respectively, but also show stronger generalization ability in other benchmarks. The code will be open-sourced for the following research and applications.


DεpS: Delayed ε-Shrinking for Faster Once-For-All Training
Aditya Annavajjala, Alind Khare, Animesh Agrawal, Igor Fedorov, Hugo Latapie, Myungjin Lee, Alexey Tumanov

CNNs are increasingly deployed across different hardware, dynamic environments, and low power embedded devices. This has led to the design and training of CNN architectures with the goal of maximizing accuracy subject to such variable deployment constraints. As the number of deployment scenarios grows, there is a need to find scalable solutions to design and train specialized CNNs. Once-for-all training has emerged as a scalable approach that jointly co-trains many models (subnets) at once with a constant training cost and finds specialized CNNs later. The scalability is achieved by training the full model and simultaneously reducing it to smaller subnets that share model weights (weight-sharing). However, existing once-for-all training approaches incur huge training costs reaching 1200 GPU hours. We argue this is because they either start the process of shrinking the full model too early or too late. Hence, we propose DES that starts the process of shrinking the full model when it is partially trained (∼50%) which leads to training cost improvement. The proposed approach also consists of novel heuristics that dynamically adjust subnet learning rates incrementally (E), leading to better weight-shared knowledge distillation from larger to smaller subnets. As a result, DES outperforms state-of-the-art once-for-all training techniques across different datasets including CIFAR10/100, ImageNet-100, and ImageNet-1k on accuracy and cost. It achieves 1.83% higher ImageNet-1k top-1 accuracy or the same accuracy with 1.3x reduction in FLOPs and 2.5x drop in training cost (GPU*hrs)


Efficient Learning of Event-based Dense Representation using Hierarchical Memories with Adaptive Update
Uday Kamal, Saibal Mukhopadhyay

Leveraging the high temporal resolution of an event-based camera requires highly efficient event-by-event processing. However, dense prediction tasks require explicit pixel-level association, which is challenging for event-based processing frameworks. Existing works aggregate the events into a static frame-like representation at the cost of a much slower processing rate and high compute cost. To address this challenge, this work introduces an event-based spatiotemporal representation learning framework for efficiently solving dense prediction tasks. We uniquely handle the sparse, asynchronous events using an unstructured, set-based approach and project them into a hierarchically organized multi-level latent memory space that preserves the pixel-level structure. Low-level event streams are dynamically encoded into these latent structures through an explicit attention-based spatial association. Unlike existing works that update these memory stacks at a fixed rate, we introduce a data-adaptive update rate that recurrently keeps track of the past memory states and learns to update the corresponding memory stacks only when it has substantial new information, thereby improving the overall compute latency. Our method consistently achieves competitive performance across different event-based dense prediction tasks while ensuring much lower latency compared to the existing methods.


I Can’t Believe It’s Not Scene Flow!
Ishan Khatri, Kyle Vedder, Neehar Peri, Deva Ramanan, James Hays

Current scene flow methods broadly fail to describe motion on small objects, and current scene flow evaluation protocols hide this failure by averaging over many points, with most drawn from larger objects. To fix this evaluation failure, we propose a new evaluation protocol, Bucket Normalized EPE, which is class-aware and speed-normalized, enabling contextualized error comparisons between object types that move at vastly different speeds. To highlight current method failures, we propose a frustratingly simple supervised scene flow baseline, TrackFlow, built by bolting a high-quality pretrained detector (trained using many class rebalancing techniques) onto a simple tracker, that produces state-of-the-art performance on current standard evaluations and large improvements over prior art on our new evaluation. Our results make it clear that all scene flow evaluations must be class and speed aware, and supervised scene flow methods must address point class imbalances. We will release the evaluation code publicly upon publication.


LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James Rehg, Miao Liu

Generating instructional images of human daily actions from an egocentric viewpoint serves a key step towards efficient skill transfer. In this paper, we introduce a novel problem — egocentric action frame generation. The goal is to synthesize the action frame conditioning on the user prompt question and an input egocentric image that captures the user’s environment. Notably, existing egocentric action datasets lack the detailed annotations that describe the execution of actions. Additionally, the existing diffusion-based image manipulation models are sub-optimal in controlling the state transition of an action in egocentric image pixel space because of the domain gap. To this end, we propose to Learn EGOcentric (LEGO) action frame  generation via visual instruction tuning. First, we introduce a prompt enhancement scheme to generate enriched action descriptions from a visual large language model (VLLM) by visual instruction tuning. Then we propose a novel method to leverage image and text embeddings from VLLM as additional conditioning to improve the performance of a diffusion model. We validate our model on two egocentric datasets — Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights in our method.


Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation
Bolin Lai, Fiona Ryan, Wenqi Jia, Miao Liu, James Rehg

Egocentric gaze anticipation serves as a key building block for the emerging capability of Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals during daily activities. Motivated by this observation, we introduce the first model that leverages both the video and audio modalities for egocentric gaze anticipation. Specifically, we propose a Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning. We conduct extensive ablation studies and thorough analysis using two egocentric video datasets: Ego4D and Aria, to validate our model design. We demonstrate the audio improves the performance by +2.5% and +2.4% on the two datasets. Our model also outperforms the prior state-of-the-art methods by at least +1.9% and +1.6%. Moreover, we provide visualizations to show the gaze anticipation results and provide additional insights into audio-visual representation learning.


NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields
Muhammad Irshad, Sergey Zakharov, Vitor Guizilini, Adrien Gaidon, Zsolt Kira, Rares Ambrus

Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF’s volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF’s radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection. Project Page: https://nerf-mae.github.io


OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects
Akshay Krishnan, Abhijit Kundu, Kevis-Kokitsi Maninis, James Hays, Matthew Brown

We propose OmniNOCS: a large-scale monocular dataset with 3D Normalized Object Coordinate Space (NOCS), object masks, and 3D bounding box annotations for indoor and outdoor scenes. OmniNOCS has 20 times more object classes and 200 times more instances than existing NOCs datasets (NOCS-Real275, Wild6D).  We use OmniNOCS to train a novel, transformer-based monocular NOCS prediction model (NOCSformer) that can predict accurate NOCS, instance masks and poses from 2D object detections across diverse classes. It is the first NOCS model that can generalize to a broad range of classes when prompted with 2D boxes.  We evaluate our model on the task of 3D oriented bounding box prediction, where it achieves comparable results to state-of-the-art 3D detection methods such as Cube R-CNN. Unlike other 3D detection methods, our model also provides detailed and accurate 3D object shape and segmentation. We propose a novel benchmark for the task of NOCS prediction based on OmniNOCS, which we hope will serve as a useful baseline for future work in this area. Our dataset and code are at the project website: https://omninocs.github.io


ORAL Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields
Yonggan Fu, Huaizhi Qu, Zhifan Ye, Chaojian Li, Kevin Zhao, Yingyan “Celine” Lin

Recent breakthroughs in Neural Radiance Fields (NeRFs) have sparked significant demand for their integration into real-world 3D applications. However, the varied functionalities required by different 3D applications often necessitate diverse NeRF models with various pipelines, leading to tedious NeRF training for each target task and cumbersome trial-and-error experiments. Drawing inspiration from the generalization capability and adaptability of emerging foundation models, our work aims to develop one general-purpose NeRF for handling diverse 3D tasks. We achieve this by proposing a framework called Omni-Recon, which is capable of (1) generalizable 3D reconstruction and zero-shot multitask scene understanding, and (2) adaptability to diverse downstream 3D applications such as real-time rendering and scene editing. Our key insight is that an image-based rendering pipeline, with accurate geometry and appearance estimation, can lift 2D image features into their 3D counterparts, thus extending widely explored 2D tasks to the 3D world in a generalizable manner. Specifically, our Omni-Recon features a general-purpose NeRF model using image-based rendering with two decoupled branches: one complex transformer-based branch that progressively fuses geometry and appearance features for accurate geometry estimation, and one lightweight branch for predicting blending weights of source views. This design achieves state-of-the-art (SOTA) generalizable 3D surface reconstruction quality with blending weights reusable across diverse tasks for zero-shot multitask scene understanding. In addition, it can enable real-time rendering after baking the complex geometry branch into meshes, swift adaptation to achieve SOTA generalizable 3D understanding performance, and seamless integration with 2D diffusion models for text-guided 3D editing. All code will be released upon acceptance.


ORAL Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang

Recent reinforcement learning (RL) works have demonstrated that using multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt in inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.


Personalized Privacy Protection Mask Against Unauthorized Facial Recognition
Ka-Ho Chow, Sihao Hu, Tiansheng Huang, Ling Liu

Face recognition (FR) can be misused for privacy intrusion. Governments, private companies, or even individual attackers can collect facial images by web scraping to build an FR system identifying human faces without their consent. This paper introduces Chameleon, which learns to generate a user-centric personalized privacy protection mask, coined as P3-Mask, to protect facial images against unauthorized FR with three salient features. First, we use a cross-image optimization to generate one P3-Mask for each user instead of tailoring facial perturbation for each facial image of a user. It enables efficient and instant protection even for users with limited computing resources. Second, we incorporate a perceptibility optimization to preserve the visual quality of the protected facial images. Third, we strengthen the robustness of P3-Mask against unknown FR models by integrating focal diversity-optimized ensemble learning into the mask generation process. Extensive experiments on two benchmark datasets show that Chameleon outperforms three state-of-the-art methods with instant protection and minimal degradation of image quality. Furthermore, Chameleon enables cost-effective FR authorization using the P3-Mask as a personalized de-obfuscation key, and it demonstrates high resilience against adaptive adversaries.


Photorealistic Video Generation with Diffusion Models
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama

We present W.A.L.T, a diffusion transformer for photorealistic video generation from text prompts. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512 x 896 resolution at 8 frames per second.


Reinforcement Learning via Auxillary Task Distillation
Abhinav Harish, Larry Heck, Josiah Hanna, Zsolt Kira, Andrew Szot

We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill); a new method for leveraging reinforcement learning (RL) in long-horizon robotic control problems by distilling behaviors from auxiliary RL tasks. AuxDistill trains pixels-to-actions policies end-to-end with RL, without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves this by concurrently doing multi-task RL in auxiliary tasks which are easier than and relevant to the main task. Behaviors learned in the auxiliary tasks are transferred to solving the main task through a weighted distillation loss. In an embodied object-rearrangement task, we show AuxDistill achieves 27% higher success rate than baselines.


SkyScenes: A Synthetic Dataset for Aerial Scene Understanding
Sahil Khose, Anisha Pal, Aayushi Agarwal, . Deepanshi, Judy Hoffman, Prithvijit Chattopadhyay

Real-world aerial scene understanding is limited by a lack of datasets that contain densely annotated images curated under a diverse set of conditions. Due to inherent challenges in obtaining such images in controlled real-world settings, we present SkyScenes, a synthetic dataset of densely annotated aerial images captured from Unmanned Aerial Vehicle (UAV) perspectives. We carefully curate SkyScenes images from CARLA to comprehensively capture diversity across layouts (urban and rural maps), weather conditions, times of day, pitch angles and altitudes with corresponding semantic, instance and depth annotations. Through our experiments using SkyScenes, we show that (1) models trained on SkyScenes generalize well to different real-world scenarios, (2) augmenting training on real images with SkyScenes data can improve real-world performance, (3) controlled variations in SkyScenes can offer insights into how models respond to changes in viewpoint conditions, and (4) incorporating additional sensor modalities (depth) can improve aerial scene understanding.


SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference
Alind Khare, Animesh Agrawal, Aditya Annavajjala, Payman Behnam, Myungjin Lee, Hugo Latapie, Alexey Tumanov

Neural Architecture Search (NAS) for Federated Learning  (FL) is an emerging field. It automates the design and training of Deep  Neural Networks (DNNs) when data cannot be centralized due to privacy,  communication costs, and regulatory restrictions. Recent federated NAS  methods not only reduce manual effort but also provide more accuracy  than traditional FL methods like FedAvg. Despite the success, existing  federated NAS methods fail to satisfy diverse deployment targets common  in on-device inference like hardware, latency budgets, or variable battery.  Most federated NAS methods search for only a limited range of archi-  tectural patterns, repeat the same pattern in DNNs and thereby harm  performance. Moreover, these methods incur prohibitive training costs  to satisfy deployment targets. They perform the training and search of  DNN architectures repeatedly for each case. We propose FedNasOdin to  address these challenges. It decouples the training and search in federated  NAS. FedNasOdin co-trains a large number of diverse DNN architectures  contained inside one supernet in the FL setting. Post-training, clients  perform NAS locally to find specialized DNNs by extracting different  parts of the trained supernet with no additional training. FedNasOdin  takes O(1) (instead of O(N)) cost to find specialized DNN architectures  in FL for any N deployment targets. As part of FedNasOdin, we introduce  MaxNet—a novel FL training algorithm that performs multi-objective  federated optimization of a large number of DNN architectures (≈ 5 ∗ 10^18)  under different client data distributions. Overall, FedNasOdin achieves  upto 37.7% higher accuracy for the same MACs or upto 8.13x reduction  in MACs for the same accuracy than existing federated NAS methods.


UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen

Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR’s generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.

See you in Milan!

Development: College of Computing
Project Lead/Data Graphics: Joshua Preston
Feature Photos: Kevin Beasley, Terence Rushin
Data Management: Joni Isbell