1

Boundary-denoising for video activity localization

Video activity localization aims at understanding the semantic content in long, untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action …

ETAD: Training Action Detection End to End on a Laptop

FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily …

Mindstorms in Natural Language-Based Societies of Mind

Query Localization in Long-form Videos

Where is my wallet? modeling object proposal sets for egocentric visual query localization

Contrastive language-action pre-training for temporal localization

Ego4d: Around the world in 3,000 hours of egocentric video

LC-NAS: Latency constrained neural architecture search for point cloud networks