Computer vision aims to derive meaningful information from visual inputs, and eventually to reason about relations about object and actions in images and videos like humans. On the other hand, natural language processing enables computers to process and interpret human languages. The intersection of vision and language, entailing visual question answering, image and video captioning, sentence localization, is a critical step towards building human-like intelligence. We study how to develop unified vision and language models to bridge the gap between these two modalities. The models we developed have achieved top performance on a multitude of international benchmark leaderboards.
Localizing events by natural language is a very challenging task that requires a fine-grained language-guided understanding of the key elements (i.e., objects) of an event. We proposed a hierarchical visual-textual graph for the feature extraction, cross-modal relation modeling, and spatial-temporal aggregation of the key elements. Finally, we obtained compact video representation for efficient event localization.
Extant Referring Image Segmentation (RIS) models devote on establishing the fine-grained correspondence between objects in images and phrases in language. However, the visual cues of referred objects are sometimes insufficent due to occlusion, small size, illumination, etc.. Hence, in this paper, we develop a Two-stage Visual cues enhancement Network (TV-Net) to enhance the visual cues for referred objects.
Recent video recognition frameworks offer excellent recognition results, yet their computational expense limits their impact for many real-world applications. We present AdaFrame, a framework that adaptively selects relevant frames on a per-input basis for fast video recognition.