MD-ProjTex: Texturing 3D Shapes with Multi-Diffusion Projection

1Bilkent University, 2Adobe Research

A yellow forklift with light dirt, 4k, realistic

A race car

A brown colored puppy with patterns on it, 4k, realistic

A tiger

A publicity photo, full color

A wolf

Our framework can texture 3D models based on given prompts. It is training-free and fast due to the parallel generation of multiple views with the multi-diffusion approach applied to the projected textures.

Abstract

We introduce MD-ProjTex, a method for fast and consistent text-guided texture generation for 3D shapes using pretrained text-to-image diffusion models. At the core of our approach is a multi-view consistency mechanism in UV space, which ensures coherent textures across different viewpoints. Specifically, MD-ProjTex fuses noise predictions from multiple views at each diffusion step and jointly updates the per-view denoising directions to maintain 3D consistency.

In contrast to existing state-of-the-art methods that rely on optimization or sequential view synthesis, MD-ProjTex is computationally more efficient and achieves better quantitative and qualitative results.

Method

MD-ProjTex Architecture

We use a latent-based depth and lineart conditioned diffusion model in our pipeline for the denoising steps. At the same time, the multi-diffusion step takes place in the projected textures. To facilitate this process, we incorporate encoder (E), denoising (SD), decoder (D), and projection steps. For simplicity, only three views are shown in the multi-view input visualization. Please note that zt, zt-1, and z0 are features in latent space represented as 4×64×64. These features are not directly visualizable.

To enhance clarity in the figure, we use downsampled images instead of them. Conversely, x0 exists in image space and is directly visualizable. Starting from zT, we initialize with a normal distribution. Subsequently, the framework employs a pipeline involving denoising, decoding, projection, and encoding for the subsequent steps.

Results

SOTA Comparison

Qualitative results of our and competing methods on the objaverse-benchmark dataset.

CLIP-Mesh
TEXTure
Text2Tex
Paint3D
Ours

Reference-based Inference

Qualitative results of our method on style image conditioning.

Style Image
Bag
Birdhouse