We introduce Lumina-DiMOO, an open-source foundational model for seamless multimodal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-diffusion paradigms and adeptly support a broad spectrum of multimodal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multimodal models. To foster further advancements in multimodal and dicrete diffusion model research, we release our code and checkpoints.
Overview of Lumina-DiMOO’s Multifunctionality and Superior Performance.
Methods | #Params | Single Object | Two Object | Counting | Colors | Position | Attibute | Overall ↑ |
---|---|---|---|---|---|---|---|---|
Gen. Only | ||||||||
SDXL | 2.6B | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 |
Emu3-Gen | 8B | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | 0.54 |
SD3-Medium | 2B | 0.99 | 0.94 | 0.72 | 0.89 | 0.33 | 0.60 | 0.74 |
DALL-E 3 | - | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 | 0.67 |
FLUX.1 [Dev] | 12B | 0.98 | 0.81 | 0.74 | 0.79 | 0.22 | 0.45 | 0.66 |
OmniGen | 3.8B | 0.98 | 0.84 | 0.66 | 0.74 | 0.40 | 0.43 | 0.68 |
Lumina-mGPT 2.0 | 7B | 0.99 | 0.87 | 0.44 | 0.85 | 0.44 | 0.54 | 0.69 |
Unified | ||||||||
Show-o | 1.3B | 0.95 | 0.52 | 0.49 | 0.82 | 0.11 | 0.28 | 0.53 |
TokenFlow-XL | 14B | 0.95 | 0.60 | 0.41 | 0.81 | 0.16 | 0.24 | 0.55 |
Janus-Pro | 7B | 0.99 | 0.89 | 0.59 | 0.90 | 0.79 | 0.66 | 0.80 |
GPT-4o | - | 0.99 | 0.92 | 0.85 | 0.92 | 0.75 | 0.61 | 0.84 |
BAGAL | 14B | 0.99 | 0.94 | 0.81 | 0.88 | 0.64 | 0.63 | 0.82 |
MMaDA | 8B | 0.99 | 0.76 | 0.61 | 0.84 | 0.20 | 0.37 | 0.63 |
Lumina-DiMOO | 8B | 1.0 | 0.94 | 0.85 | 0.89 | 0.85 | 0.76 | 0.88 |
Methods | #Params | Global | Entity | Attribute | Relation | Other | Overall ↑ |
---|---|---|---|---|---|---|---|
Gen. Only | |||||||
SDXL | 2.6B | 83.27 | 82.43 | 80.91 | 86.76 | 80.41 | 74.65 |
Emu3-Gen | 8B | 85.21 | 86.68 | 86.84 | 90.22 | 83.15 | 80.60 |
SD3-Medium | 2B | 87.90 | 91.01 | 88.83 | 80.70 | 88.68 | 84.08 |
DALL-E 3 | - | 90.97 | 89.61 | 88.39 | 90.58 | 89.83 | 83.50 |
FLUX.1 [Dev] | 12B | 74.35 | 90.00 | 88.96 | 90.87 | 88.33 | 83.84 |
OmniGen | 3.8B | 87.90 | 88.97 | 88.47 | 87.95 | 83.56 | 81.16 |
Lumina-mGPT 2.0 | 7B | - | 88.94 | 88.08 | 91.70 | - | 84.30 |
Unified | |||||||
Show-o | 1.3B | - | - | - | - | - | 67.48 |
TokenFlow-XL | 14B | 78.72 | 79.22 | 81.29 | 85.22 | 71.20 | 73.38 |
Janus-Pro | 7B | 86.90 | 88.90 | 89.40 | 89.32 | 89.48 | 84.19 |
GPT-4o | - | 88.89 | 88.94 | 89.84 | 92.63 | 90.96 | 85.15 |
BAGAL | 14B | 88.94 | 90.37 | 91.29 | 90.82 | 88.67 | 85.07 |
MMaDA | 8B | 77.81 | 78.48 | 81.74 | 84.79 | 63.2 | 69.97 |
Lumina-DiMOO | 8B | 81.46 | 92.08 | 88.98 | 94.31 | 82.0 | 86.04 |
Methods | #Params | POPE | MME-P | MMB | SEED | MMMU |
---|---|---|---|---|---|---|
Under. Only | ||||||
LLaVA | 7B | 76.3 | 809.6 | 38.7 | 33.5 | - |
LLaVA-v1.5 | 7B | 85.9 | 1510.7 | 64.3 | 58.6 | 35.4 |
InstructBLIP | 7B | - | - | 36.0 | 53.4 | - |
Qwen-VL-Chat | 7B | - | 1487.5 | 60.6 | 58.2 | - |
Emu3-Chat | 8B | 85.2 | 1244 | 58.5 | 68.2 | 31.6 |
Unified | ||||||
Show-o | 1.3B | 80.0 | 1097.2 | - | - | 26.7 |
TokenFlow-XL | 13B | 86.8 | 1545.9 | 68.9 | 68.7 | 38.7 |
Janus-Pro | 7B | 87.4 | 1567.1 | 79.2 | 72.1 | 41.0 |
BAGAL | 14B | - | 1687 | 85.0 | - | 55.3 |
MMaDA | 8B | 86.1 | 1410.7 | 68.5 | 64.2 | 30.2 |
Lumina-DiMOO | 8B | 87.4 | 1534.2 | 84.5 | 83.1 | 58.6 |