Chat-3D

Abstract

3D scene understanding has gained significant attention due to its wide range of applications. However, existing methods for 3D scene understanding are limited to specific downstream tasks, which hinders their practicality in real-world applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes. Specifically, we align 3D representations into the feature space of LLMs, thus enabling LLMs to perceive the 3D world. Given the scarcity of 3D scene-text data, we propose a three-stage training strategy to efficiently utilize the available data for better alignment. To enhance the reasoning ability and develop a user-friendly interaction scheme, we further construct a high-quality object-centric 3D instruction dataset and design an associated object-centric prompt. Our experiments show that Chat-3D achieves an impressive ability to comprehend diverse instructions for 3D scenes, engage in intricate spatial reasoning, and incorporate external knowledge into its responses. Chat-3D achieves a 75.6% relative score compared with GPT-4 on the constructed instruction dataset. Our contributions can be summarized into three parts:

Chat-3D Architecture . We build Chat-3D, the innovative universal dialogue system for 3D scenes, which leverages the advanced visual perception capabilities of 3D pre-trained models, in conjunction with the powerful reasoning and open-domain conversational abilities of LLMs.
Data-efficient Three-stage Training Scheme. We introduce a new three-stage training scheme for multi-modal LLM, enabling the model to progressively transition from learning individual object attributes to capturing complex spatial object relations. This approach effectively improves the quality of dialogue with limited available data.
Object-centric Instruct Data and Prompt. We construct a high-quality object-centric 3D instruction dataset including diverse dialogues about object attributes, positions, relationships, functionalities, placement suggestions, and detailed descriptions within 3D scenes. We propose a corresponding object-centric prompt approach to provide a user-friendly interaction method.

Chat-3D Architecture

Data-efficient Three-stage Training Scheme

Chat-3D employs a more data-efficient three-stage training process to alleviate the scarcity of 3D-language data.

Stage 1: 3D Object Alignment. We train linear projection layers to aligning 3D object features to the word embedding space of LLM via maximize similarity between embeddings.
Stage 2: 3D Scene Alignment. We further train a relational module that captures complex relationships among 3D objects to represent the semantics of the entire 3D scene through caption learning objectives.
Stage 3: 3D Object-centric Instruct Tuning. We curate a high-quality 3D object-centric instruction dataset for fine-tuning our model.

Examples on Universal Dialogue of 3D Scenes

We provide visualization examples of conversations about 3D scenes with Chat-3D. These cases highlight the powerful perceptual, reasoning, and conversational capabilities of Chat-3D for 3D scenes.

Comparisons between Chat-3D and 2D Multi-modal LLMs

Comparisons between Chat-3D and 2D multi-modal LLM methods (such as MiniGPT-4, LLaVA, and mPLUG-owl) demonstrate the advantages and necessity of developing a specific multi-modal LLM for 3D scenes.

BibTeX


  @misc{wang2023chat3d,
      title={Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes},
      author={Zehan Wang and Haifeng Huang and Yang Zhao and Ziang Zhang and Zhou Zhao},
      year={2023},
      eprint={2308.08769},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}