AI21推出Jamba,这是世界上首个基于Mamba的生产级模型。Jamba结合了Mamba结构化状态空间(SSM)技术和传统的Transformer架构的元素,弥补了纯SSM模型固有的局限。 <img class="aligncenter size-full wp-image-5325" src="https://img.xiaohu.ai/2024/03/65fc211b53db0e4d61ac6677_jamba-best.svg" alt="" width="670" height="389" /> <h3 data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">Jamba的模型架构:</h3> <ol data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"> <p data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"><strong data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">SSM-Transformer混合架构</strong>:Jamba采用了一种创新的混合模型架构,通过将SSM(结构化状态空间模型)与Transformer层交错使用,融合了两者的优势。这种架构使得Jamba在处理长序列数据时更为高效,同时保持了对复杂数据模式的高度理解能力。</p> </li> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"> <p data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"><strong data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">Mixture-of-Experts (MoE) 层</strong>:Jamba利用了MoE层来进一步提升模型的性能和效率。MoE层允许模型在推理时动态选择最合适的“专家”子模块处理特定的数据或任务,从而提高计算效率和处理能力。</p> </li> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"> <p data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"><strong data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">大上下文窗口</strong>:得益于其SSM组件,Jamba能够处理极大的上下文窗口,达到256K个词元的规模。这意味着Jamba可以在进行文本理解和生成时,考虑到比传统模型更长的上下文信息,从而提高其理解和预测的准确性。</p> </li> </ol> <img class="aligncenter size-full wp-image-5321" src="https://img.xiaohu.ai/2024/03/65fc2607612a6f271c8e402c_jamba-architecture.svg" alt="" width="1140" height="682" /> <strong data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">要点:</strong> <ul data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">模型拥有52B(亿)参数,但在生成过程中仅有12B参数处于活跃状态。</li> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">包含16个专家,但在生成过程中仅2个专家处于活跃状态。</li> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">采用了新的架构,结合了Joint Attention和Mamba技术。</li> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">支持高达256K(256,000个词元)的上下文长度。</li> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">能够在单个A100 80GB GPU上处理高达140K上下文。</li> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">与Mixtral 8x7B相比,在处理长上下文时的吞吐量提高了3倍。</li> </ul> <h3>背景知识</h3> <p data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">Jamba代表了在模型设计上的一大创新。这里的"Mamba"指的是一种结构化状态空间模型(Structured State Space Model, SSM),这是一种用于捕捉和处理数据随时间变化的模型,特别适合处理序列数据,如文本或时间序列数据。<strong>SSM模型的一个关键优势是其能够高效地处理长序列数据,但它在处理复杂模式和依赖时可能不如其他模型强大。</strong></p> <p data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">而"Transformer"架构是近年来人工智能领域最为成功的模型之一,特别是在自然语言处理(NLP)任务中。它能够非常有效地处理和理解语言数据,捕捉长距离的依赖关系,但处理长序列数据时会遇到计算效率和内存消耗的问题。</p> <p data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">Jamba模型将Mamba的SSM技术和Transformer架构的元素结合起来,旨在发挥两者的优势,同时克服它们各自的局限。<strong>通过这种结合,Jamba不仅能够高效处理长序列数据(这是Mamba的强项),还能保持对复杂语言模式和依赖关系的高度理解(这是Transformer的优势)。</strong>这意味着Jamba模型在处理需要理解大量文本和复杂依赖关系的任务时,既能保持高效率,又不会牺牲性能或精度。</p> <h3 data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">Jamba的功能特点:</h3> <ol data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"> <p data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"><strong data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">高吞吐量和效率</strong>:Jamba是第一个基于Mamba的生产级模型,建立在新颖的SSM-Transformer混合架构之上。显示出比同类模型(尤其是传统的Transformer模型)有显著提升的处理速度。它在处理长文本数据时,能够实现高达3倍的吞吐量提升,这一点在需要处理大量文本数据的应用场景中特别重要。 <img class="aligncenter size-full wp-image-5323" src="https://img.xiaohu.ai/2024/03/66053deb9bfd03888dfe4be1_jamba-throughput-desktop.svg" alt="" width="1140" height="609" /></p> </li> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"> <p style="text-align: left;" data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"><strong data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">适应大上下文的能力</strong>:Jamba是少数几个能够在单个GPU上处理高达140K上下文大小的模型之一,使其在执行需考虑大量历史信息的复杂任务时表现出色。在处理长上下文的能力上,Jamba能够有效管理和利用高达256K(256,000个词元)的上下文窗口。这在传统的Transformer模型中是难以实现的,因为Transformer模型的内存需求随着上下文长度的增加而指数级增长。Jamba的这一特性使其在理解和生成长文本时具有显著优势。 <img class="aligncenter size-full wp-image-5322" src="https://img.xiaohu.ai/2024/03/66053e49c39da81a09569f43_jamba-context-desktop.svg" alt="" width="1140" height="511" />Jamba 是同类产品中唯一能在单个 GPU 上运行 140K 上下文的机型。</p> </li> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"> <p data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"><strong data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">开放和灵活的部署</strong>:Jamba的模型权重以开源形式发布,遵循Apache 2.0许可,支持广泛的自定义和优化。此外,它支持在NVIDIA的AI企业平台上作为微服务部署,为企业级应用提供了灵活的部署选项。</p> </li> <li data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"> <p data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f"><strong data-immersive-translate-walked="3f6d3fd6-88ad-429b-90df-cab7d49ae38f">模型规模和效率:</strong><span style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;">Jamba采用Mixture-of-Experts (MoE) 层和结构化状态空间(SSM)技术,能够在不牺牲性能的前提下,有效地减少模型在推理时的参数数量。这意味着它能够在相对较低的计算资源下运行,同时保持或超越其他模型的性能。</span></p> </li> </ol> <p class="heading-5-style text-neutral margin-bot-20px sizing mobile-margin-30px" data-immersive-translate-walked="87ec5864-4260-45a2-9d47-f4bac180fdc3" data-immersive-translate-paragraph="1"><strong><span class="notranslate immersive-translate-target-wrapper" lang="zh-CN" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-translation-theme-none immersive-translate-target-translation-block-wrapper-theme-none immersive-translate-target-translation-block-wrapper" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-inner immersive-translate-target-translation-theme-none-inner" data-immersive-translate-translation-element-mark="1">Jamba 优于或可与同尺寸级别中的其他型号相媲美</span></span></span></strong></p> <p class="text-xl text-neutral margin-bot-35px sizing font-weight-300" style="text-align: center;" data-immersive-translate-walked="87ec5864-4260-45a2-9d47-f4bac180fdc3" data-immersive-translate-paragraph="1"><span class="notranslate immersive-translate-target-wrapper" lang="zh-CN" data-immersive-translate-translation-element-mark="1"><img class="aligncenter size-full wp-image-5324" src="https://img.xiaohu.ai/2024/03/66056df31e77bedbe421df13_jamba-benchmark-desktop.svg" alt="" width="1140" height="566" /><span class="notranslate immersive-translate-target-translation-theme-none immersive-translate-target-translation-block-wrapper-theme-none immersive-translate-target-translation-block-wrapper" data-immersive-translate-translation-element-mark="1"><span class="notranslate immersive-translate-target-inner immersive-translate-target-translation-theme-none-inner" data-immersive-translate-translation-element-mark="1">Jamba 在推理相关基准方面得分最高</span></span></span></p> <p data-immersive-translate-walked="87ec5864-4260-45a2-9d47-f4bac180fdc3" data-immersive-translate-paragraph="1">网站:<a href="https://www.ai21.com/jamba" target="_blank" rel="noopener">https://www.ai21.com/jamba</a></p> <p data-immersive-translate-walked="87ec5864-4260-45a2-9d47-f4bac180fdc3" data-immersive-translate-paragraph="1">详细介绍:<a href="https://www.ai21.com/blog/announcing-jamba" target="_blank" rel="noopener">https://www.ai21.com/blog/announcing-jamba</a></p> <p data-immersive-translate-walked="87ec5864-4260-45a2-9d47-f4bac180fdc3" data-immersive-translate-paragraph="1">模型:<a href="https://huggingface.co/ai21labs/Jamba-v0.1" target="_blank" rel="noopener">https://huggingface.co/ai21labs/Jamba-v0.1</a></p>