L u m i n a - D i M O O

An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

(♣ Equal Contributions, † Corresponding Authors)
1 Shanghai AI Laboratory   2 Shanghai Innovation Institute   3 Shanghai Jiao Tong University   4 The University of Sydney  
5 Nanjing University   6 The Chinese University of Hong Kong   7 Tsinghua University ‌

Abstract

We introduce Lumina-DiMOO, an open-source foundational model for seamless multimodal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-diffusion paradigms and adeptly support a broad spectrum of multimodal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multimodal models. To foster further advancements in multimodal and dicrete diffusion model research, we release our code and checkpoints.

Lumina-DiMOO Overall

Overview of Lumina-DiMOO’s Multifunctionality and Superior Performance.

Demo Examples

Text-to-Image Generation

Sample 1

A striking photograph of a glass of orange juice on a wooden kitchen table, capturing a playful moment. The orange juice splashes out of the glass and forms the word "Smile" in a whimsical, swirling script just above the glass. The background is softly blurred, revealing a cozy, homely kitchen with warm lighting and a sense of comfort.

Sample 2

Two ramekins filled with a creamy dessert, possibly a fruit custard or pudding, are presented. The dessert is garnished with diced mangoes, blueberries, and a drizzle of honey. The ramekins rest on a wooden board, surrounded by fresh mint leaves and a small bowl of honey. A bowl of fresh strawberries and a bunch of mint leaves are visible. The setting is completed with a white cloth and a wooden spoon.

Sample 3

A beautifully aged antique book is positioned carefully for a studio close-up, revealing a rich, dark brown leather cover. The words "Knowledge is Power" are prominently featured in the center with thick, flowing brushstrokes, gleaming in opulent gold paint. Tiny flecks of the gold leaf can be seen scattered around the ornately scripted letters, showcasing the craftsmanship that went into its creation.

Sample 4

Clean white brick wall, vibrant colorful spray-paint graffiti covering entire surface: top giant bubble letters "Lumina DiMOO", below stacked "MULTIMODAL DISCRETE", "DIFFUSION MODEL" in rainbow palette, fresh wet paint drips, daylight urban photography.

Sample 5

A collection of vibrant red roses is artfully arranged on a rustic wooden surface. The roses, in full bloom, display their intricate petal layers and deep red hue, while the wooden background, with its visible grain patterns and knots, adds a textured contrast. The roses are placed in a cluster, with some overlapping others, creating a sense of depth and dimension.

Sample 6

Close-up portrait of a young woman with light skin and long brown hair, looking directly at the camera. Her face is illuminated by dramatic, slatted sunlight casting shadows across her features, creating a pattern of light and shadow. Her eyes are a striking green, and her lips are slightly parted, with a natural pink hue. The background is a soft, dark gradient, enhancing the focus on her face. The lighting is warm and golden.

Sample 7

Illustrate a poignant moment from a slice-of-life anime, with two high school friends sharing a heartfelt conversation under a cherry blossom tree, petals gently falling around them.

Sample 8

Enchanted forest scene with large wooden letters "B", "E", and "A" adorned with greenery and colorful flowers, positioned in the center and right of the frame. The letters are embellished with red and white flowers, small white ornaments, and trailing vines. The forest floor is covered with brown pine needles and scattered leaves, with small glowing lanterns placed around the letters, casting warm light.

Sample 9

Design a poster for a fictitious film noir festival, featuring a shadowy detective silhouette against a backdrop of a rain-slicked 1940s city street, the title "Shadows and Fog: A Noir Retrospective" in stylish, vintage typography, all presented in a classic black and white sketch style.

Sample 10

A stunning photograph of a Scandinavian landscape, showcasing snow-capped mountains, dense evergreen forests, and a tranquil lake reflecting the clear blue sky. The scene is bathed in soft sunlight, creating a warm and inviting atmosphere. In the distance, a small, picturesque village can be seen nestled among the trees, with smoke gently rising from a few chimneys.

Sample 11

A captivating photograph of an exquisite wooden dragon sculpture, skillfully carved with intricate details and realistic scales. The dragon is poised on a tree branch, its grand wings spread wide, revealing a mesmerizing woodland landscape below. The sky is painted with a symphony of soft blues and yellows, as the sun casts its final rays beyond the horizon. The dragon's glass eyes lend it a lifelike presence.

Sample 12

A small hamster with a fluffy, light brown coat sits centrally on a red and orange striped sofa. The hamster's eyes are wide and alert, facing directly at the camera. The sofa's fabric features vertical stripes with a dark green and white border, creating a vibrant contrast with the hamster's soft fur. In the background, a dark green knitted blanket partially covers the sofa, adding texture to the scene.

Image Editing

Sample 1

Add a large bowl filled with tomato-based stew garnished with basil on a wooden table in the central area, occupying most of the middle and lower part of the image.

Sample 2

Add a tan and black bike frame with large, thick tires in the lower-left to upper-right center, spanning most of the image and occupying about three-quarters of the total area.

Sample 3

Remove all baked goods located in the upper central to upper right section, occupying a large horizontal area across the top of the image.

Sample 4

Stylize the image according to book Illustration with clear outlines and narrative focus.

Sample 5

Change the plain wall into brick wall.

Sample 6

Replace bird positioned in the upper central-right area of the image with a colorful butterfly.

Style Transfer

Sample 1
Sample 2
Sample 3
Sample 4

Subject-Driven Generation

Sample 1

A vibrant, intricately designed beaded hair accessory. At a bustling outdoor market at midday, it glimmers in the bright sunlight, resting delicately on a wooden table covered with colorful woven fabrics, while a gentle breeze rustles the nearby array of potted plants

Sample 2

A durable and elegant hardcover writing tool. Set against a sleek minimalist interior with its dark cover contrasting against white marble countertops, under the brilliant, diffused glow of overhead pendant lights that create an atmosphere of quiet elegance and focus.

Sample 3

A milkshake with whipped cream topping. During an evening music festival, it glows softly under twinkling fairy lights, with a blurred stage in the background showcasing musicians in action.

Controllable Generation

Sample 1

A charming porcelain teacup with floral patterns is perched elegantly on a wrought iron café table on a bustling city street, under a large umbrella casting gentle shade, while pedestrians stroll by in the background.

Sample 2

On a chic café table, a sleek modern tabletop lamp adds a modern touch to the morning bustle as sunlight filters through a nearby window, blending with the aroma of fresh coffee.

Sample 3

Captured in a bustling urban street at twilight, A creamy, rich-flavored dark beverage, is placed on an outdoor café table, as city lights begin to twinkle and passersby create a lively atmosphere.

Sample 4

In an urban park during a light drizzle, beneath dense tree cover that filters the overcast light, a man wearing a waterproof windbreaker is walking.

Inpainting and Extrapolation

Sample 1

Porsche showroom. Make there be a Porsche logo on the back wall behind the car.

Sample 2

A contemporary basement bar area, featuring sharp, bright colors, high-quality lighting, and a mood of modern relaxation. Include depth of field and a strong sense of atmosphere.

Sample 1

A breathtaking mountain range dramatically rising above a still alpine lake at dawn. The snow-capped peaks are bathed in the warm glow of the rising sun, displaying hues of vibrant orange, pink, and gold.

Sample 2

A serene, snow-capped mountain range reflected in a crystal-clear turquoise lake. The towering peaks are dusted with fresh snow, their slopes covered in vibrant green pine trees reaching towards the sky.

Image Understanding

这是第一张图片对应的问题。请在这里输入您想要展示的问题内容。
Answer:

这是第一张图片问题对应的答案。您可以在这里提供详细的解答内容。答案可以很长,这个区域会自动支持滚动显示。现在右侧区域会占据更多空间来显示完整的答案内容。

Experimental Results

GenEval Benchmark

Methods #Params Single Object Two Object Counting Colors Position Attibute Overall ↑
Gen. Only
SDXL 2.6B 0.98 0.74 0.39 0.85 0.15 0.23 0.55
Emu3-Gen 8B 0.98 0.71 0.34 0.81 0.17 0.21 0.54
SD3-Medium 2B 0.99 0.94 0.72 0.89 0.33 0.60 0.74
DALL-E 3 - 0.96 0.87 0.47 0.83 0.43 0.45 0.67
FLUX.1 [Dev] 12B 0.98 0.81 0.74 0.79 0.22 0.45 0.66
OmniGen 3.8B 0.98 0.84 0.66 0.74 0.40 0.43 0.68
Lumina-mGPT 2.0 7B 0.99 0.87 0.44 0.85 0.44 0.54 0.69
Unified
Show-o 1.3B 0.95 0.52 0.49 0.82 0.11 0.28 0.53
TokenFlow-XL 14B 0.95 0.60 0.41 0.81 0.16 0.24 0.55
Janus-Pro 7B 0.99 0.89 0.59 0.90 0.79 0.66 0.80
GPT-4o - 0.99 0.92 0.85 0.92 0.75 0.61 0.84
BAGAL 14B 0.99 0.94 0.81 0.88 0.64 0.63 0.82
MMaDA 8B 0.99 0.76 0.61 0.84 0.20 0.37 0.63
Lumina-DiMOO 8B 1.0 0.94 0.85 0.89 0.85 0.76 0.88

DPG Benchmark

Methods #Params Global Entity Attribute Relation Other Overall ↑
Gen. Only
SDXL 2.6B 83.27 82.43 80.91 86.76 80.41 74.65
Emu3-Gen 8B 85.21 86.68 86.84 90.22 83.15 80.60
SD3-Medium 2B 87.90 91.01 88.83 80.70 88.68 84.08
DALL-E 3 - 90.97 89.61 88.39 90.58 89.83 83.50
FLUX.1 [Dev] 12B 74.35 90.00 88.96 90.87 88.33 83.84
OmniGen 3.8B 87.90 88.97 88.47 87.95 83.56 81.16
Lumina-mGPT 2.0 7B - 88.94 88.08 91.70 - 84.30
Unified
Show-o 1.3B - - - - - 67.48
TokenFlow-XL 14B 78.72 79.22 81.29 85.22 71.20 73.38
Janus-Pro 7B 86.90 88.90 89.40 89.32 89.48 84.19
GPT-4o - 88.89 88.94 89.84 92.63 90.96 85.15
BAGAL 14B 88.94 90.37 91.29 90.82 88.67 85.07
MMaDA 8B 77.81 78.48 81.74 84.79 63.2 69.97
Lumina-DiMOO 8B 81.46 92.08 88.98 94.31 82.0 86.04

Image Understanding Benchmark

Methods #Params POPE MME-P MMB SEED MMMU
Under. Only
LLaVA 7B 76.3 809.6 38.7 33.5 -
LLaVA-v1.5 7B 85.9 1510.7 64.3 58.6 35.4
InstructBLIP 7B - - 36.0 53.4 -
Qwen-VL-Chat 7B - 1487.5 60.6 58.2 -
Emu3-Chat 8B 85.2 1244 58.5 68.2 31.6
Unified
Show-o 1.3B 80.0 1097.2 - - 26.7
TokenFlow-XL 13B 86.8 1545.9 68.9 68.7 38.7
Janus-Pro 7B 87.4 1567.1 79.2 72.1 41.0
BAGAL 14B - 1687 85.0 - 55.3
MMaDA 8B 86.1 1410.7 68.5 64.2 30.2
Lumina-DiMOO 8B 87.4 1534.2 84.5 83.1 58.6

Acknowledgements

This work was also supported and implemented by MindSpeed MM, an open-source training framework for large-scale multimodal models designed for distributed training, developed and maintained by Huawei's Computing Product Line. Specifically Optimized for Huawei‘s Ascend AI chips, MindSpeed MM offers comprehensive support for distributed training and is tailored for a wide range of multimodal tasks.

Citation

@article{Lumina-DiMOO,
 title={Lumina-DiMOO:An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding},
 author={Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Yuandong Pu, Le Zhuo, Shenglong Ye, Ming Hu, Junjun He, Bo Zhang, Dengyang Jiang, Gen Luo, Chang Xu, Wenhai Wang, Hongsheng Li, Guangtao Zhai, Tianfan Xue, Xiaohong Liu, Bin Fu, Yu Qiao, and Yihao Liu},
 year={2025}
}