SLD是一种自我纠正的LLM控制扩散框架,它通过结合大型语言模型的检测能力,使图像生成模型能够更加精准地根据文本描述生成图像。 它不仅能生成高质量的图像,还能对图像进行细节级别的编辑,比如改变图像中对象的数量、属性等。 其最大的特点是通用性,可以与任何图像生成器兼容,极大地拓展了使用范围。这项技术在提高文本到图像对齐的准确性方面展现了巨大的潜力,为未来的图像生成和编辑技术开辟了新的道路。 <img class="wp-image-3083 aligncenter" src="https://img.xiaohu.ai/2024/03/teaser-1024x781.png" alt="" width="780" height="595" /> <h3>主要功能</h3> <ol> <li data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"> <p data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"><strong data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e">文本到图像的精准对齐</strong>:通过LLM(大语言模型)集成的检测器增强生成模型,实现对文本描述与生成图像之间精确的对齐。能够根据人们提供的文字描述,生成与之高度匹配的图像。这意味着如果你描述了一个特定的场景或物体,SLD能够创建出紧密符合这些描述的图片。</p> </li> <li data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"> <p data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"><strong data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e">图像的生成与编辑</strong>:不仅能够从零开始根据文本描述创造出新的图像,还能对现有的图像进行详细的编辑。比如,可以根据用户的指令改变图片中的某些细节,如物体的数量、颜色或位置等。</p> </li> <li data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"> <p data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"><strong data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e">与任何图像生成器的兼容性</strong>:SLD设计之初就考虑到了与任何现有的图像生成器(例如DALL-E 3)的兼容性,无需进行额外的模型训练或添加新的数据集。这种通用性使得SLD可以轻松集成到不同的图像生成框架中,扩展其使用场景和功能。</p> </li> </ol> <h3>工作原理</h3> <img class=" wp-image-3082 aligncenter" src="https://img.xiaohu.ai/2024/03/main_figure-1024x674.jpg" alt="" width="810" height="533" /> <ol> <li data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"> <p data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"><strong data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e">理解文本描述</strong>:首先,SLD使用大型语言模型(LLM)来深入理解用户提供的文本描述。这个过程涉及到分析文本中的关键信息,如物体、动作和场景设置等。</p> </li> <li data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"> <p data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"><strong data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e">生成初步图像</strong>:基于理解的文本描述,SLD调用图像生成模型开始创建与文本对应的图像。这一步骤的目标是根据文本内容产生一个初步的、大致匹配描述的视觉表示。</p> </li> <li data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"> <p data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"><strong data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e">检测与自我纠正</strong>:生成的初步图像接着通过LLM集成的检测系统进行评估,这个系统会检查图像是否准确反映了文本描述的各个方面。如果发现图像与文本描述不完全对齐,SLD会自动进行调整。这个自我纠正过程可能包括改变图像中的对象位置、数量、颜色或其他属性,以更精确地匹配原始文本描述。</p> </li> <li data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"> <p data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"><strong data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e">迭代改进</strong>:SLD可能会重复进行检测与自我纠正的过程,每次都在试图更准确地将文本描述转换为图像。通过这种迭代改进,最终生成的图像能够非常精确地反映文本意图。</p> </li> <li data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"> <p data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e"><strong data-immersive-translate-walked="b76d5401-65ef-428f-977d-45df9f40835e">细粒度编辑</strong>:除了生成图像,SLD还支持根据用户的进一步指示对图像进行细节编辑,如调整特定对象的特征或在图像中添加新元素。这种编辑过程同样利用了LLM的理解能力和图像生成模型的灵活性,允许用户以非常自然的方式指导图像的最终形态。</p> </li> </ol> <h3>案例展示:</h3> <ul> <li data-immersive-translate-paragraph="1"><strong><span class="notranslate immersive-translate-target-wrapper" lang="zh-CN" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-translation-theme-none immersive-translate-target-translation-block-wrapper-theme-none immersive-translate-target-translation-block-wrapper" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-inner immersive-translate-target-translation-theme-none-inner" data-immersive-translate-translation-element-mark="1">可视化:纠正各种生成模型</span></span></span></strong></li> </ul> <p data-immersive-translate-paragraph="1">该框架能够改进如SDXL、LMD+和DALL-E 3等模型的文本到图像对齐,</p> <p class="paragraph" data-immersive-translate-paragraph="1"><span class="notranslate immersive-translate-target-wrapper" lang="zh-CN" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-translation-theme-none immersive-translate-target-translation-block-wrapper-theme-none immersive-translate-target-translation-block-wrapper" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-inner immersive-translate-target-translation-theme-none-inner" data-immersive-translate-translation-element-mark="1">第一行显示了 SLD 将蓝色自行车放置在长凳和棕榈树之间的精确度,并准确计数了棕榈树和海鸥。</span></span></span></p> <p class="paragraph" data-immersive-translate-paragraph="1"><span class="notranslate immersive-translate-target-wrapper" lang="zh-CN" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-translation-theme-none immersive-translate-target-translation-block-wrapper-theme-none immersive-translate-target-translation-block-wrapper" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-inner immersive-translate-target-translation-theme-none-inner" data-immersive-translate-translation-element-mark="1">第二行展示了 SLD 在复杂场景中的功效,通过免训练的潜在操作防止对象碰撞。</span></span></span></p> <p data-immersive-translate-paragraph="1"><img class=" wp-image-3081 aligncenter" src="https://img.xiaohu.ai/2024/03/result1-1024x461.png" alt="" width="828" height="373" /></p> <ul> <li data-immersive-translate-paragraph="1"><strong><span class="notranslate immersive-translate-target-wrapper" lang="zh-CN" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-translation-theme-none immersive-translate-target-translation-block-wrapper-theme-none immersive-translate-target-translation-block-wrapper" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-inner immersive-translate-target-translation-theme-none-inner" data-immersive-translate-translation-element-mark="1">可视化:复杂的对象级编辑</span></span></span></strong></li> </ul> <strong><span class="notranslate immersive-translate-target-wrapper" lang="zh-CN" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-translation-theme-none immersive-translate-target-translation-block-wrapper-theme-none immersive-translate-target-translation-block-wrapper" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-inner immersive-translate-target-translation-theme-none-inner" data-immersive-translate-translation-element-mark="1"> </span></span></span></strong>SLD还能处理由自然、类人的指令引导的多样化图像编辑任务,其能力范围从调整对象数量到改变对象属性、位置和大小。 <p data-immersive-translate-paragraph="1"><img class=" wp-image-3080 aligncenter" src="https://img.xiaohu.ai/2024/03/result2-1024x496.png" alt="" width="850" height="412" /></p> <p data-immersive-translate-paragraph="1">项目及演示:<a href="https://self-correcting-llm-diffusion.github.io/" target="_blank" rel="noopener">https://self-correcting-llm-diffusion.github.io/</a></p> <p data-immersive-translate-paragraph="1"><span class="notranslate immersive-translate-target-wrapper" lang="zh-CN" translate="no" data-immersive-translate-loading-id="250">论文:<a href="https://arxiv.org/abs/2311.16090" target="_blank" rel="noopener">https://arxiv.org/abs/2311.16090</a></span></p> <p data-immersive-translate-paragraph="1"><span class="notranslate immersive-translate-target-wrapper" lang="zh-CN" translate="no" data-immersive-translate-loading-id="250">GitHub: <a href="https://github.com/tsunghan-wu/SLD" target="_blank" rel="noopener">https://github.com/tsunghan-wu/SLD</a></span></p>