FVL Laboratory | Home

About Us

The Fudan Vision and Learning (FVL) Laboratory was founded by Professor Yu-Gang Jiang to push the frontiers of computer vision and machine learning research and translate breakthroughs into real-world impact. Our mission is to equip machines with the ability to learn, perceive, and reason as humans by extracting rich knowledge from massive amounts of data, such as images, videos and languages.

Over the years, we have developed innovative computer vision and machine learning techniques for large-scale video recognition, video captioning and language-based retrieval, trustworthy machine learning, and more. Our research has been published in flagship conferences and journals, transferred to real-world applications like defect detection for high-speed railway infrastructures, mobile video search, among other things, and adopted in winning solutions for international benchmark competitions.

Research

Video Understanding

FAILED

Videos capture the appearance of objects and people and how those evolve over time. The rich interactions among the objects exhibit great challenges for automated video content understanding. We explore visual, audio, textual information and mine relationships of different modalities for representation learning, and build lightweight and efficient recognition models through pruning redundant information. The resulting models achieve top-notch performance on a variety of video understanding benchmarks. Furthermore, to inspire video recognition research, we have also released several large-scale video classification benchmarks that have been widely used in both academia and industry.

Vision and Language

FAILED

Computer vision aims to derive meaningful information from visual inputs, and eventually to reason about relations about object and actions in images and videos like humans. On the other hand, natural language processing enables computers to process and interpret human languages. The intersection of vision and language, entailing visual question answering, image and video captioning, sentence localization, is a critical step towards building human-like intelligence. We study how to develop unified vision and language models to bridge the gap between these two modalities. The models we developed have achieved top performance on a multitude of international benchmark leaderboards.

Trustworthy ML

FAILED

As AI systems are increasingly being deployed in real-world applications to make decisions that affect our daily lives, the trustworthiness of these systems has become more important than ever. We study the trustworthiness issues of the deep learning models used in modern AI systems with the goal to build robust, secure, fair, and privacy-preserving machine learning models for future AI. Adversarial attack/defense, backdoor attack/defense, robust learning, fair learning, privacy leakage/defense, data protection, and AI intellectual property protection are among the topics we explore. The robust models, learning methods, loss functions, and attack and defense strategies developed by our team have made substantial impact in the global community, with either theoretical contributions or empirical results.


Featured Publications

FAILED

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

FAILED

SimDA: Simple Diffusion Adapter for Efficient Video Generation

FAILED

MotionEditor: Editing Video Motion via Content-Aware Diffusion