The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) is the most influential top annual conference in the field of Artificial Intelligence. According to the Google Scholar metrics, CVPR ranks the fourth among all academic publications worldwide, and the first in the field of computer science and artificial intelligence. CVPR 2023 received 9,155 submissions and accepted 2,360, with a 25.78% acceptance rate. Eleven papers from the Fudan Vision and Learning (FVL) Laboratory have been accepted, covering important research directions such as video understanding, object detection and segmentation, scalable deep neural networks, adversarial attacks and defense, and digital pathology image analysis.
Exploring dense matching between the current frame and past frames for long-range context modeling, memory-based methods have demonstrated impressive results in video object segmentation (VOS) recently. Nevertheless, due to the lack of instance understanding ability, the above approaches are oftentimes brittle to large appearance variations or viewpoint changes resulted from the movement of objects and cameras.
Benefiting from masked visual modeling, self-supervised video representation learning has achieved remarkable progress. Existing methods (e.g., BEVT and VideoMAE) focus on learning representations from scratch through reconstructing low-level features like raw pixel RGB values or low-level VQ-VAE tokens. However, using low-level features as reconstruction targets often incur much noise. And due to the high redundancy in video data, it is easy for masked video modeling to learn shortcuts, thus resulting in limited transfer performance on downstream tasks.
Semi-supervised action recognition is a challenging but critical task due to the high cost of video annotations. Existing approaches mainly use convolutional neural networks, yet current revolutionary vision transformer models have been less explored. In this paper, we investigate the use of transformer models under the SSL setting for action recognition. To this end, we introduce SVFormer, which adopts a steady pseudo-labeling framework (EMA-Teacher) to cope with unlabeled video samples.
Accurate and reliable 3D object detection in autonomous driving systems relies on the fusion of LiDAR and camera information. However, the challenge lies in efficiently combining multi-granularity geometric and semantic features from these two modalities, which possess drastically different characteristics. Existing approaches aim to enhance the semantic density of camera features by lifting points in 2D camera images (referred to as seeds) into 3D space, followed by the fusion of 2D semantics through cross-modal interaction techniques. However, depth information is underexplored in these methods when lifting points into 3D space, resulting in unreliable fusion of 2D semantics with 3D points. Moreover, their multi-modal fusion strategy lacks fine-grained interactions in the voxel space.
Combining multiple datasets enables performance boost on many computer vision tasks. But similar trend has not been witnessed in object detection when combining multiple datasets due to two inconsistencies among detection datasets: taxonomy difference and domain gap. In this paper, we address these challenges by a new design (named Detection Hub) that is dataset-aware and category-aligned. It not only mitigates the dataset inconsistency but also provides coherent guidance for the detector to learn across multiple datasets. In particular, the dataset-aware design is achieved by learning a dataset embedding that is used to adapt object queries as well as convolutional kernels in detection heads. The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding and leveraging the semantic coherence of language embedding. Detection Hub fulfills the benefits of large data on object detection. Experiments demonstrate that joint training on multiple datasets achieves significant performance gains over training on each dataset alone. Detection Hub further achieves SOTA performance on UODB benchmark with wide variety of datasets.
Anomaly detection and localization are widely used in industrial manufacturing for its efficiency and effectiveness. Anomalies are rare and hard to collect and supervised models easily over-fit to these seen anomalies with a handful of abnormal samples, producing unsatisfactory performance. On the other hand, anomalies are typically subtle, hard to discern, and of various appearance, making it difficult to detect anomalies and let alone locate anomalous regions.
Cross-Domain Few-Shot Learning (CD-FSL) is a recently emerging task that tackles few-shot learning across different domains. It aims at transferring prior knowledge learned on the source dataset to novel target datasets. The CD-FSL task is especially challenged by the huge domain gap between different datasets. Critically, such a domain gap actually comes from the changes of visual styles, and our prior wave-SAN empirically shows that spanning the style distribution of the source data helps alleviate this issue. However, wave-SAN simply swaps styles of two images. Such a vanilla operation makes the generated styles ``real'' and ``easy'', which still fall into the original set of the source styles. Thus, inspired by vanilla adversarial learning, a novel model-agnostic meta Style Adversarial training (StyleAdv) method together with a novel style adversarial attack method is proposed for CD-FSL. Particularly, our style attack method synthesizes both ``virtual'' and ``hard'' adversarial styles for model training.
Vision Transformers (ViTs) have achieved overwhelming success. Meanwhile, low-resolution inputs facilitate ViTs with efficiency and high-resolution samples are necessary for high performance as well as dense down-stream tasks. Nevertheless, ViTs suffer from vulnerable resolution scalability, i.e., their performance drops drastically when presented with input resolutions that are much higher or lower than the training one.
It has been demonstrated in recent works that adversarial examples have the properties of transferability, which means an adversarial example generated on one white-box model can be used to fool other black-box models. The existence of transferability brings convenience to performing black-box attacks, hence raising security concerns for deploying deep models in real-world applications. Consequently, considerable research attention has been spent on improving the transferability of adversarial examples for both non-targeted and targeted attacks. Compared to non-targeted attacks, transfer-based targeted attacks are inherently much more challenging since the goal is to fool deep models into predicting the specific target class.
There is a growing interest in developing unlearnable examples (UEs) against visual privacy leaks on the Internet. UEs are training samples added with invisible but unlearnable noise, which have been found can prevent unauthorized training of machine learning models. However, existing UE generation methods all rely on an ideal assumption called label-consistency, where the hackers and protectors are assumed to hold the same label for a given sample. In this work, we propose and promote a more practical label-agnostic setting, where the hackers may exploit the protected data quite differently from the protectors. Existing UE generation methods are rendered ineffective in this challenging setting.
Pathological examination is the gold standard for cancer diagnosis. In general, HE staining and multiple IHC staining are all needed to ensure the accuracy of the examination, making it complicated, time-consuming and expensive. Virtual staining of pathological images aims to use frontier artificial intelligence technology, especially style transfer, to find the method of generating specific IHC staining images from HE staining images. Due to the ultra-high-resolution characteristic of whole slide images, most of the current researches divide WSI images into patches and process them separately, and then obtain the target WSI images through simple post-processing. This results in differences in color and brightness between adjacent patches, making the generated images less authentic. This problem is defined as the square effect by the research team for the first time.