Videos capture the appearance of objects and people and how those evolve over time. The rich interactions among the objects exhibit great challenges for automated video content understanding. We explore visual, audio, textual information in videos and mine relationships of different modalities for representation learning, and the resulting models achieve top-notch performance on a variety of video understanding benchmarks. In addition, video recognition models are computationally expensive, preventing their deployment in real world applications. This motivates us to build lightweight and efficient models that save computation without any degradation in performance. Furthermore, to inspire video recognition research, we have also released several large-scale video classification benchmarks that have been widely used in both academia and industry.
We study BERT pretraining of video transformers, which decouples video representation learning into spatial representation learning and temporal dynamics learning. BEVT first performs masked image modeling on image data, and then conducts masked image modeling jointly with masked video modeling on video data.
Label distributions in the real-world are oftentimes long-tailed and imbalanced, resulting in biased models towards dominant labels. While long-tailed recognition has been extensively studied for image classification tasks, limited effort has been made for the video domain. We introduce VideoLT, a large-scale long-tailed video recognition dataset, as a step toward real-world video recognition.
Recent video recognition frameworks offer excellent recognition results, yet their computational expense limits their impact for many real-world applications. We present LiteEval, a framework that adaptively selects the optimal resolution on a per-input basis for fast video recognition.