Transformer 模型中的 LayerNorm 和 RMS Norm

作者： Adrian Tam 发布于 2025 年 8 月 18 日在构建 Transformer 模型 0

归一化层是 Transformer 模型中帮助稳定训练的关键组件。如果没有归一化，模型通常无法收敛或表现不佳。本文将探讨 LayerNorm、RMS Norm 及其变体，解释它们的工作原理以及在现代语言模型中的实现方式。

让我们开始吧。

Transformer 模型中的 LayerNorm 和 RMS Norm
照片来源：Redd Francisco。保留部分权利。

概述

本文分为五个部分，它们是：

Transformer 中需要归一化的原因
LayerNorm 及其实现
自适应 LayerNorm
RMS Norm 及其实现
使用 PyTorch 的内置归一化

Transformer 中需要归一化的原因

归一化层可以提高深度学习中的模型质量。卷积模型通常在卷积层之后使用批归一化，而 Transformer 模型则将归一化与注意力机制和前馈组件交错使用。

归一化很重要，原因如下：

内部协变量偏移：随着数据流经网络，激活分布在训练步骤之间会发生显著变化，导致训练不稳定并需要仔细调整学习率。归一化重新对齐激活分布，使得对一层更新不会严重影响下一层的功能。
梯度问题：深度网络存在梯度消失问题，因为激活函数在接近零时变化很大，但在极端值处保持平坦，导致这些区域的梯度为零。梯度消失会阻止进一步的训练，因此将激活值移回零变得至关重要。
更快的收敛：归一化将梯度保持在合理范围内，使梯度下降更有效，并实现更快的收敛。此外，归一化值聚集在零附近，创建了一个更小的搜索空间，从而加速了训练过程中最优参数的查找。

Transformer 模型通常有许多层。例如，Llama 3 8B 模型有 32 个解码器块，每个块包含一个注意力层和三个顺序连接的前馈层。这种结构使得良好的梯度流动至关重要，通过策略性地放置归一化层来实现。

LayerNorm 和 RMSNorm 是现代 Transformer 中最常见的两种归一化技术。它们在计算归一化统计量方面有所不同。以下各节将详细描述它们。

LayerNorm 及其实现

Layer norm，与 batch norm、instance norm 或 group norm 类似，对输入张量执行平移和缩放操作：

$$
y = \frac{x – \mu}{\sqrt{\sigma^2 + \epsilon}}
$$

小量 $\epsilon$ 可以防止除以零。均值 $\mu$ 和方差 $\sigma^2$ 是根据特征维度上的输入数据计算的。以下是实现：

import torch
import torch.nn as nn

class LayerNorm(nn.Module):
    def __init__(self, eps=1e-5):
        super().__init__()
        self.eps = eps
    
    def forward(self, x):
        # Calculate mean and variance across the last dimension(s)
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        
        # Normalize
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        return x_norm

# Example usage
batch_size, seq_len, hidden_dim = 2, 5, 128
x = torch.randn(batch_size, seq_len, hidden_dim)
layer_norm = LayerNorm()
output = layer_norm(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Output mean:\n{output.mean(axis=2)}")
print(f"Output std:\n{output.std(axis=2, correction=0)}")

import torch

import torch.nn as nn

class LayerNorm(nn.Module):

def __init__(self, eps=1e-5):

super().__init__()

self.eps = eps

def forward(self, x):

# Calculate mean and variance across the last dimension(s)

mean = x.mean(dim=-1, keepdim=True)

var = x.var(dim=-1, keepdim=True, unbiased=False)

# Normalize

x_norm = (x - mean) / torch.sqrt(var + self.eps)

return x_norm

# 示例用法

batch_size, seq_len, hidden_dim = 2, 5, 128

x = torch.randn(batch_size, seq_len, hidden_dim)

layer_norm = LayerNorm()

output = layer_norm(x)

print(f"Input shape: {x.shape}")

print(f"Output shape: {output.shape}")

print(f"Output mean:\n{output.mean(axis=2)}")

print(f"Output std:\n{output.std(axis=2, correction=0)}")

LayerNorm 计算方差时没有偏差校正：$\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i – \mu)^2$。虽然你可以使用无偏估计，但这是传统的实现。上面这个简单的实现没有可学习参数：它只对输入张量进行平移和缩放。运行这段代码会产生均值接近零、方差为 1 的输出，表明归一化是正确的。

运行此代码时，您可能会得到以下输出：

Input shape: torch.Size([2, 5, 128])
Output shape: torch.Size([2, 5, 128])
Output mean:
tensor([[-1.8626e-09,  2.4214e-08, -3.7253e-09, -9.3132e-09,  1.4901e-08],
        [-1.4901e-08, -1.2107e-08,  1.4901e-08, -7.4506e-09,  2.2352e-08]])
Output std:
tensor([[1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000]])

输入形状：torch.Size([2, 5, 128])

输出形状：torch.Size([2, 5, 128])

输出均值

tensor([[-1.8626e-09, 2.4214e-08, -3.7253e-09, -9.3132e-09, 1.4901e-08],

[-1.4901e-08, -1.2107e-08, 1.4901e-08, -7.4506e-09, 2.2352e-08]])

输出标准差

tensor([[1.0000, 1.0000, 1.0000, 1.0000, 1.0000],

[1.0000, 1.0000, 1.0000, 1.0000, 1.0000]])

输出张量保留了所有信息，但以更适合神经网络操作的范围分布了值。LayerNorm 独立应用于序列中的每个元素，对整个特征向量进行归一化。

你可能会想，为什么我们需要零均值和单位方差的输出。答案是：不一定。大多数 LayerNorm 实现执行以下操作：

$$
y = \gamma \frac{x – \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$

其中 $\gamma$ 和 $\beta$ 是独立应用于每个向量元素的可学习参数。以下是修改后的实现：

import torch
import torch.nn as nn

class LayerNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
        self.bias = nn.Parameter(torch.zeros(dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        return x_norm * self.weight + self.bias

# Example usage
batch_size, seq_len, hidden_dim = 2, 5, 128
x = torch.randn(batch_size, seq_len, hidden_dim)
layer_norm = LayerNorm(hidden_dim)
output = layer_norm(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Output mean:\n{output.mean(axis=2)}")
print(f"Output std:\n{output.std(axis=2, correction=0)}")

import torch

import torch.nn as nn

class LayerNorm(nn.Module):

def __init__(self, dim, eps=1e-5):

super().__init__()

self.eps = eps

self.weight = nn.Parameter(torch.ones(dim))

self.bias = nn.Parameter(torch.zeros(dim))

def forward(self, x):

mean = x.mean(dim=-1, keepdim=True)

var = x.var(dim=-1, keepdim=True, unbiased=False)

x_norm = (x - mean) / torch.sqrt(var + self.eps)

return x_norm * self.weight + self.bias

# 示例用法

batch_size, seq_len, hidden_dim = 2, 5, 128

x = torch.randn(batch_size, seq_len, hidden_dim)

layer_norm = LayerNorm(hidden_dim)

output = layer_norm(x)

print(f"Input shape: {x.shape}")

print(f"Output shape: {output.shape}")

print(f"Output mean:\n{output.mean(axis=2)}")

print(f"Output std:\n{output.std(axis=2, correction=0)}")

由于 $\gamma$ 和 $\beta$ 应用于每个向量，它们必须与向量形状匹配。您在创建 LayerNorm 模块时指定向量长度，参数分别初始化为 1 和 0。在训练期间，这些参数会调整以优化下一层的输出。

自适应 LayerNorm

上一节中的 $\gamma$ 和 $\beta$ 参数是可学习的，但有时您希望它们能够适应输入 $x$，而不是对所有输入使用相同的值。Xu 等人在 2019 年引入的自适应 LayerNorm 实现了这个想法。虽然在语言模型中不常见，但在扩散模型等其他架构中却很流行。

在公式中，原始论文中的自适应层归一化是：

$$
y = C (1 – kx) \odot \frac{x – \mu}{\sqrt{\sigma^2 + \epsilon}}
$$

其中 $C$ 是一个超参数，$k$ 固定为 0.1。$(1-kx)$ 乘法是逐元素的。存在其他变体，但核心思想是使缩放和平移参数成为输入数据的函数。一种流行的实现使用线性层来计算这些参数：

import torch
import torch.nn as nn

class AdaptiveLayerNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.dim = dim
        self.eps = eps
        
        # Adaptive parameters
        self.ada_weight = nn.Linear(dim, dim)
        self.ada_bias = nn.Linear(dim, dim)
    
    def forward(self, x):
        # Standard LayerNorm
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        
        # Adaptive scaling and shifting
        ada_w = self.ada_weight(x)
        ada_b = self.ada_bias(x)
        
        return x_norm * ada_w + ada_b


# Example usage
batch_size, seq_len, hidden_dim = 2, 5, 8
x = torch.randn(batch_size, seq_len, hidden_dim)

ada_ln = AdaptiveLayerNorm(hidden_dim)
output = ada_ln(x)

import torch

import torch.nn as nn

class AdaptiveLayerNorm(nn.Module):

def __init__(self, dim, eps=1e-5):

super().__init__()

self.dim = dim

self.eps = eps

# 自适应参数

self.ada_weight = nn.Linear(dim, dim)

self.ada_bias = nn.Linear(dim, dim)

def forward(self, x):

# 标准 LayerNorm

mean = x.mean(dim=-1, keepdim=True)

var = x.var(dim=-1, keepdim=True, unbiased=False)

x_norm = (x - mean) / torch.sqrt(var + self.eps)

# 自适应缩放和平移

ada_w = self.ada_weight(x)

ada_b = self.ada_bias(x)

return x_norm * ada_w + ada_b

# 示例用法

batch_size, seq_len, hidden_dim = 2, 5, 8

x = torch.randn(batch_size, seq_len, hidden_dim)

ada_ln = AdaptiveLayerNorm(hidden_dim)

output = ada_ln(x)

RMS Norm 及其实现

大多数最新的 Transformer 模型使用 RMS Norm 而不是 LayerNorm。关键区别在于 RMS Norm 只对输入进行缩放而不进行平移。其数学公式为：

$$\text{RMSNorm}(x) = \gamma \odot \frac{x}{\sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2 + \epsilon}}$$

其中 $x$ 是一个维度为 $d$ 的向量。分母计算向量元素的均方根值。小量 $\epsilon$ 可以防止除以零，$\gamma$ 是用于元素乘法的可学习向量。

与 LayerNorm 相比，RMS Norm 需要更少的计算量，并且内存占用更小。以下是实现：

import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.dim = dim
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        # Calculate RMS across the last dimension(s)
        rms = torch.rsqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps)

        # Normalize
        x_norm = x * rms * self.weight
        return x_norm

# Example usage
batch_size, seq_len, hidden_dim = 2, 5, 8
x = torch.randn(batch_size, seq_len, hidden_dim)
rms_norm = RMSNorm(hidden_dim)
output = rms_norm(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Output RMS: {torch.sqrt((output**2).mean(axis=2))}")

import torch

import torch.nn as nn

class RMSNorm(nn.Module):

def __init__(self, dim, eps=1e-6):

super().__init__()

self.dim = dim

self.eps = eps

self.weight = nn.Parameter(torch.ones(dim))

def forward(self, x):

# 计算最后一维的 RMS 值

rms = torch.rsqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps)

# Normalize

x_norm = x * rms * self.weight

return x_norm

# 示例用法

batch_size, seq_len, hidden_dim = 2, 5, 8

x = torch.randn(batch_size, seq_len, hidden_dim)

rms_norm = RMSNorm(hidden_dim)

output = rms_norm(x)

print(f"Input shape: {x.shape}")

print(f"Output shape: {output.shape}")

print(f"Output RMS: {torch.sqrt((output**2).mean(axis=2))}")

RMS Norm 在某些情况下可能不如 LayerNorm，因为它不将激活值居中于零。然而，它对离群值不那么敏感，因为它不减去均值。在 RMS Norm 和 LayerNorm 之间做出选择最终是 Transformer 模型的设计决策。

使用 PyTorch 的内置归一化

尽管从头开始了解如何实现归一化很有价值，但在实际操作中，您应该使用 PyTorch 的内置模块以获得更好的性能。

PyTorch 的 LayerNorm 包含缩放和偏移参数，而 RMSNorm 只有缩放参数。以下是使用它们的方法：

import torch
import torch.nn as nn

# PyTorch's LayerNorm
batch_size, seq_len, hidden_dim = 2, 5, 8
x = torch.randn(batch_size, seq_len, hidden_dim)

# LayerNorm normalizes over the last dimension
layer_norm = nn.LayerNorm(hidden_dim)
output_ln = layer_norm(x)

# RMSNorm normalizes over the last dimension
rms_norm = nn.RMSNorm(hidden_dim)
output_rms = rms_norm(x)

import torch

import torch.nn as nn

# PyTorch 的 LayerNorm

batch_size, seq_len, hidden_dim = 2, 5, 8

x = torch.randn(batch_size, seq_len, hidden_dim)

# LayerNorm 对最后一维进行归一化

layer_norm = nn.LayerNorm(hidden_dim)

output_ln = layer_norm(x)

# RMSNorm 对最后一维进行归一化

rms_norm = nn.RMSNorm(hidden_dim)

output_rms = rms_norm(x)

您可以验证每个模块是否具有可学习参数：

...
print(layer_norm.weight) # nn.Parameter
print(layer_norm.bias) # nn.Parameter
print(rms_norm.weight) # nn.Parameter

...

print(layer_norm.weight) # nn.Parameter

print(layer_norm.bias) # nn.Parameter

print(rms_norm.weight) # nn.Parameter

进一步阅读

以下是一些您可能会觉得有用的资源：

总结

在这篇文章中，您了解了 Transformer 模型中的归一化技术。具体来说，您学习了：

为什么归一化对于深度网络的训练稳定性是必要的
LayerNorm 和 RMS Norm 的工作原理及其数学公式
如何从头开始实现这些归一化技术
如何使用 PyTorch 的内置归一化层

归一化是实现深度 Transformer 模型训练的基本组件。理解这些技术有助于设计更稳定和高效的架构。

导航

Transformer 模型中的 LayerNorm 和 RMS Norm

概述

Transformer 中需要归一化的原因

LayerNorm 及其实现

自适应 LayerNorm

RMS Norm 及其实现

使用 PyTorch 的内置归一化

进一步阅读

总结

关于此主题的更多信息

暂无评论。

发表评论点击此处取消回复。

导航

概述

Transformer 中需要归一化的原因

LayerNorm 及其实现

自适应 LayerNorm

RMS Norm 及其实现

使用 PyTorch 的内置归一化

进一步阅读

总结

关于此主题的更多信息

暂无评论。

发表评论 点击此处取消回复。

发表评论点击此处取消回复。