Dr Mengmeng Xu (Frost)

AI Research Scientist

KAUST

Biography

Frost is a dedicated engineer and scientist in the field of image and video understanding, with the ultimate goal of giving people the AI power to build community and bring the world closer together. In 2023, Frost received his PhD degree at King Abdullah University of Science and Technology (KAUST) under the supervision of Professor Bernard Ghanem, where he worked on query localization in long-form videos. Prior to his graduate studies, Frost earned a bachelor’s degree from Zhejiang University, College of Optical Science and Engineering. Frost also enjoys learning to play Guqin in his free time.

Please email xu.frost[at]gmail.com for the up-to-date CV.

Interests

Video Understanding
Video Generation

Education

MS/PhD in Electrical and Computer Engineering, 2017 - 2023
King Abdullah University of Science and Technology
BSc in Opto-Elctronics Information Science and Engineering, 2013 - 2017
Zhejiang University

Skills

Engineering

2013 - Present

Deep Learning

2017 - Present

Generative Model

2020 - Present

Experience

AI Research Scientest

Meta Platforms, Inc.

Jan 2024 – Present London, UK

Research Scientest Intern

Meta Platforms, Inc.

Mar 2022 – Oct 2022 London, UK

Applied Scientest Intern

Amazon.com, Inc

Sep 2021 – Jan 2022 Berlin, Germany

Research Intern

Samsung Electronics (UK) Ltd

Oct 2020 – Mar 2021 Cambridge, UK

Featured Publications

Mengmeng Xu, Mattia Soldan, Jialin Gao, Shuming Liu, Juan-Manuel Perez-Rua, Bernard Ghanem

January 2023 ICLR 2024

Boundary-denoising for video activity localization

Video activity localization aims at understanding the semantic content in long, untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and there are often no clear-cut transitions between actions. Moreover, the definition of the start and end of events is subjective, which may confuse the model. To alleviate the boundary ambiguity, we propose to study the video activity localization problem from a denoising perspective. Specifically, we propose an encoder-decoder model named DenosieLoc. During training, a set of action spans is randomly generated from the ground truth with a controlled noise scale. Then, we attempt to reverse this process by boundary denoising, allowing the localizer to predict activities with precise boundaries and resulting in faster convergence speed. Experiments show that DenosieLoc advances several video activity understanding tasks. For example, we observe a gain of +12.36% average mAP on the QV-Highlights dataset. Moreover, DenosieLoc achieves state-of-the-art performance on the MAD dataset but with much fewer predictions than others.

Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, Juan-Manuel Perez-Rua

January 2023 CVPR 2024

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.

Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, Bernard Ghanem

July 2013 CVPR 2020

G-TAD: Sub-Graph Localization for Temporal Action Detection

Temporal action detection is a fundamental yet challenging task in video understanding. Video context is a critical cue to effectively detect actions, but current works mainly focus on temporal context, while neglecting semantic context as well as other important context properties. In this work, we propose a graph convolutional network (GCN) model to adaptively incorporate multi-level semantic context into video features and cast temporal action detection as a sub-graph localization problem. Specifically, we formulate video snippets as graph nodes, snippet-snippet correlations as edges, and actions associated with context as target sub-graphs. With graph convolution as the basic operation, we design a GCN block called GCNeXt, which learns the features of each node by aggregating its context and dynamically updates the edges in the graph. To localize each sub-graph, we also design a SGAlign layer to embed each sub-graph into the Euclidean space. Extensive experiments show that G-TAD is capable of finding effective video context without extra supervision and achieves state-of-the-art performance on two detection benchmarks. On ActityNet-1.3, we obtain an average mAP of 34.09%; on THUMOS14, we obtain 40.16% in mAP@0.5, beating all the other one-stage methods.

Recent Publications

Quickly discover relevant content by filtering publications.

Shuming Liu, Mengmeng Xu, Chen Zhao, Xu Zhao, Bernard Ghanem (2023). ETAD: Training Action Detection End to End on a Laptop. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Cite

Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, Sen He (2023). FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing. ICLR 2024.

Cite

Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, others (2023). Mindstorms in Natural Language-Based Societies of Mind. arXiv preprint arXiv:2305.17066.

Cite

Mengmeng Xu (2023). Query Localization in Long-form Videos.

Cite

Mengmeng Xu, Yanghao Li, Cheng-Yang Fu, Bernard Ghanem, Tao Xiang, Juan-Manuel Perez-Rua (2023). Where is my wallet? modeling object proposal sets for egocentric visual query localization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Cite

See all publications

Contact

xu[dot]frost[at]gmail.com
London, NW1