Frost is an AI Research Scientist at Meta, based in London. He is dedicated to advancing the field of video generation and understanding, with the ultimate goal of giving people the power to build community and bring the world closer together.
In 2023, Frost recieved his PhD degree at King Abdullah University of Science and Technology (KAUST) under the supervision of Professor Bernard Ghanem, where he worked on Query Localization in Long-form Videos. He enjoys learning to play Guqin in his free time.
Please email xu.frost[at]gmail.com for the up-to-date CV.
MS/PhD in Electrical and Computer Engineering, 2017 - 2023
King Abdullah University of Science and Technology
BSc in Opto-Elctronics Information Science and Engineering, 2013 - 2017
Zhejiang University
2013 - Present
2017 - Present
2020 - Present
Video activity localization aims at understanding the semantic content in long, untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and there are often no clear-cut transitions between actions. Moreover, the definition of the start and end of events is subjective, which may confuse the model. To alleviate the boundary ambiguity, we propose to study the video activity localization problem from a denoising perspective. Specifically, we propose an encoder-decoder model named DenosieLoc. During training, a set of action spans is randomly generated from the ground truth with a controlled noise scale. Then, we attempt to reverse this process by boundary denoising, allowing the localizer to predict activities with precise boundaries and resulting in faster convergence speed. Experiments show that DenosieLoc advances several video activity understanding tasks. For example, we observe a gain of +12.36% average mAP on the QV-Highlights dataset. Moreover, DenosieLoc achieves state-of-the-art performance on the MAD dataset but with much fewer predictions than others.
Temporal action detection is a fundamental yet challenging task in video understanding. Video context is a critical cue to effectively detect actions, but current works mainly focus on temporal context, while neglecting semantic context as well as other important context properties. In this work, we propose a graph convolutional network (GCN) model to adaptively incorporate multi-level semantic context into video features and cast temporal action detection as a sub-graph localization problem. Specifically, we formulate video snippets as graph nodes, snippet-snippet correlations as edges, and actions associated with context as target sub-graphs. With graph convolution as the basic operation, we design a GCN block called GCNeXt, which learns the features of each node by aggregating its context and dynamically updates the edges in the graph. To localize each sub-graph, we also design a SGAlign layer to embed each sub-graph into the Euclidean space. Extensive experiments show that G-TAD is capable of finding effective video context without extra supervision and achieves state-of-the-art performance on two detection benchmarks. On ActityNet-1.3, we obtain an average mAP of 34.09%; on THUMOS14, we obtain 40.16% in mAP@0.5, beating all the other one-stage methods.