量化¶

警告

量化處於 Beta 階段，可能會發生變更。

量化簡介¶

量化指的是以低於浮點精度的位元寬度執行計算和儲存張量的技術。量化模型在張量上執行部分或全部的運算時，使用降低的精度而非完整精度（浮點）值。這允許更緊湊的模型表示，並在許多硬體平台上使用高效能的向量化運算。與典型的 FP32 模型相比，PyTorch 支援 INT8 量化，允許模型大小減少 4 倍，並且記憶體頻寬需求減少 4 倍。INT8 計算的硬體支援通常比 FP32 計算快 2 到 4 倍。量化主要是一種加速推論的技術，並且量化運算符僅支援正向傳遞。

PyTorch 支援多種量化深度學習模型的方法。在大多數情況下，模型在 FP32 中訓練，然後將模型轉換為 INT8。此外，PyTorch 還支援感知量化訓練 (quantization aware training)，它使用偽量化模組 (fake-quantization modules) 在前向和後向傳遞中對量化誤差進行建模。請注意，整個計算都在浮點中進行。在感知量化訓練結束時，PyTorch 提供了轉換函數，可將經過訓練的模型轉換為較低的精度。

在較低的層級，PyTorch 提供了一種表示量化張量並對其執行操作的方法。它們可用於直接構建模型，以較低的精度執行全部或部分計算。提供了更高等級的 API，它們整合了將 FP32 模型轉換為較低精度的典型工作流程，並最大程度地減少了精確度的損失。

量化 API 摘要¶

PyTorch 提供了三種不同的量化模式：Eager 模式量化、FX 圖模式量化（維護中）和 PyTorch 2 匯出量化。

Eager 模式量化是一個 beta 功能。使用者需要手動進行融合，並指定量化和反量化發生的位置，而且它僅支援模組，而不支援 functionals。

FX 圖模式量化是 PyTorch 中的自動化量化工作流程，目前是一個原型功能，自從有了 PyTorch 2 匯出量化以來，它處於維護模式。它通過增加對 functionals 的支援並自動化量化過程，從而改進了 Eager 模式量化，儘管人們可能需要重構模型，以使該模型與 FX 圖模式量化兼容（使用 torch.fx 進行符號追蹤）。請注意，FX 圖模式量化預計不會在任意模型上工作，因為該模型可能無法進行符號追蹤，我們將其整合到 torchvision 等領域庫中，並且使用者將能夠使用 FX 圖模式量化來量化與受支援領域庫中的模型類似的模型。對於任意模型，我們將提供一般準則，但要使其真正發揮作用，使用者可能需要熟悉 torch.fx，特別是如何使模型可進行符號追蹤。

PyTorch 2 匯出量化是新的完整圖模式量化工作流程，在 PyTorch 2.1 中作為原型功能發布。借助 PyTorch 2，我們正在轉向更好的完整程式捕獲解決方案（torch.export），因為與 torch.fx.symbolic_trace（在 14K 個模型上的捕獲率為 72.7%）相比，它可以捕獲更高百分比（在 14K 個模型上的捕獲率為 88.8%）的模型，而 torch.fx.symbolic_trace 是 FX 圖模式量化使用的程式捕獲解決方案。torch.export 仍然存在圍繞某些 python 結構的限制，並且需要使用者參與以支援匯出模型中的動態性，但總體而言，它比以前的程式捕獲解決方案有所改進。PyTorch 2 匯出量化是為 torch.export 捕獲的模型而構建的，同時兼顧了建模使用者和後端開發人員的靈活性和生產力。主要功能包括：(1). 可程式化的 API，用於配置模型的量化方式，可以擴展到更多用例 (2). 簡化的建模使用者和後端開發人員的使用者體驗，因為他們只需要與單個對象（Quantizer）進行交互，以表達使用者關於如何量化模型以及後端支援的意圖。(3). 可選的參考量化模型表示，可以使用整數運算表示量化計算，更接近於硬體中發生的實際量化計算。

鼓勵量化的新使用者首先嘗試 PyTorch 2 匯出量化，如果效果不佳，使用者可以嘗試 eager 模式量化。

下表比較了 Eager 模式量化、FX 圖模式量化和 PyTorch 2 匯出量化之間的差異

	Eager 模式量化	FX 圖模式量化	PyTorch 2 匯出量化
發布狀態	beta	原型（維護中）	原型
運算元融合	手動	自動	自動
Quant/DeQuant 位置	手動	自動	自動
量化模組	支援	支援	支援
量化 Functionals/Torch Ops	手動	自動	支援
自訂支援	有限支援	完全支援	完全支援
量化模式支援	後訓練量化：靜態、動態、僅權重感知量化訓練：靜態	後訓練量化：靜態、動態、僅權重感知量化訓練：靜態	由後端特定的量化器定義
輸入/輸出模型類型	`torch.nn.Module`	`torch.nn.Module` （可能需要一些重構，以使該模型與 FX 圖模式量化兼容）	`torch.fx.GraphModule` (由 `torch.export` 捕獲

支援三種量化類型

動態量化（權重量化，啟動以浮點數讀取/儲存，並量化用於計算）
靜態量化（權重量化，啟動量化，訓練後需要校準）
靜態感知量化訓練（權重量化，啟動量化，在訓練期間對量化數值進行建模）

請參閱我們的 PyTorch 量化入門博客文章，以更全面地了解這些量化類型之間的權衡。

動態和靜態量化之間的運算元覆蓋率各不相同，並在下表中捕獲。

	靜態量化	動態量化
nn.Linear nn.Conv1d/2d/3d	Y Y	Y N
nn.LSTM nn.GRU	Y（通過自訂模組） N	Y Y
nn.RNNCell nn.GRUCell nn.LSTMCell	N N N	Y Y Y
nn.EmbeddingBag	Y (啟動在 fp32 中)	Y
nn.Embedding	Y	Y
nn.MultiheadAttention	Y（通過自訂模組）	不支援
啟動	廣泛支援	未更改，計算保持在 fp32 中

Eager 模式量化¶

有關量化流程的一般介紹，包括不同類型的量化，請參閱一般量化流程。

後訓練動態量化¶

這是最容易應用的量化形式，其中權重會提前量化，但啟動會在推論期間動態量化。這適用於模型執行時間主要由從記憶體載入權重而不是計算矩陣乘法決定的情況。這對於具有小批量大小的 LSTM 和 Transformer 類型模型來說是如此。

圖表

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                 /
linear_weight_fp32

# dynamically quantized model
# linear and LSTM weights are in int8
previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
                     /
   linear_weight_int8

PTDQ API 範例

import torch

# define a floating point model
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(4, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp32,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)

要了解有關動態量化的更多資訊，請參閱我們的動態量化教學課程。

後訓練靜態量化¶

後訓練靜態量化 (PTQ static) 量化模型的權重和啟動。它將啟動融合到前面的層中（如果可能）。它需要使用代表性資料集進行校準，以確定啟動的最佳量化參數。當記憶體頻寬和計算節省都很重要時，通常會使用後訓練靜態量化，而 CNN 是一個典型的用例。

在套用訓練後靜態量化之前，我們可能需要修改模型。請參閱Eager Mode靜態量化的模型準備。

圖表

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                    /
    linear_weight_fp32

# statically quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                    /
  linear_weight_int8

PTSQ API 範例

import torch

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')

# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

若要了解更多關於靜態量化的資訊，請參閱靜態量化教學。

靜態量化的量化感知訓練¶

量化感知訓練 (QAT) 在訓練期間模擬量化的影響，與其他量化方法相比，可以實現更高的準確性。我們可以對靜態、動態或僅權重量化進行 QAT。在訓練期間，所有計算都以浮點數完成，fake_quant模組通過鉗制和捨入來模擬量化的影響，以模擬 INT8 的效果。模型轉換後，權重和激活被量化，並且激活在可能的情況下會融合到前面的層中。它通常與 CNN 結合使用，並且與靜態量化相比，可以產生更高的準確性。

在套用訓練後靜態量化之前，我們可能需要修改模型。請參閱Eager Mode靜態量化的模型準備。

圖表

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                      /
    linear_weight_fp32

# model with fake_quants for modeling quantization numerics during training
previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
                           /
   linear_weight_fp32 -- fq

# quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                     /
   linear_weight_int8

QAT API 範例

import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval for fusion to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')

# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

# Prepare the model for QAT. This inserts observers and fake_quants in
# the model needs to be set to train for QAT logic to work
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

# run the training loop (not shown)
training_loop(model_fp32_prepared)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

若要了解更多關於量化感知訓練的資訊，請參閱QAT 教學。

Eager Mode靜態量化的模型準備¶

目前，在 Eager mode 量化之前，需要對模型定義進行一些修改。這是因為目前量化是基於模組進行的。具體來說，對於所有量化技術，使用者需要

將任何需要輸出重新量化（因此具有額外參數）的操作從函數形式轉換為模組形式（例如，使用 torch.nn.ReLU 而不是 torch.nn.functional.relu）。
通過在子模組上分配 .qconfig 屬性或指定 qconfig_mapping，來指定模型的哪些部分需要量化。例如，設置 model.conv1.qconfig = None 表示 model.conv 層將不會被量化，而設置 model.linear1.qconfig = custom_qconfig 表示 model.linear1 的量化設定將使用 custom_qconfig 而不是全域 qconfig。

對於量化激活的靜態量化技術，使用者還需要執行以下操作

指定激活量化和反量化的位置。這是通過使用 QuantStub 和 DeQuantStub 模組完成的。
使用 FloatFunctional 將需要特殊量化處理的張量操作封裝到模組中。示例是像 add 和 cat 這樣的操作，它們需要特殊處理來確定輸出量化參數。
融合模組：將操作/模組組合到單個模組中，以獲得更高的準確性和效能。這是通過使用 fuse_modules() API 完成的，該 API 接受要融合的模組列表。我們目前支援以下融合：[Conv, Relu], [Conv, BatchNorm], [Conv, BatchNorm, Relu], [Linear, Relu]

(原型 - 維護模式) FX Graph Mode 量化¶

訓練後量化中有幾種類型（僅權重、動態和靜態），配置通過 qconfig_mapping 完成（prepare_fx 函數的一個參數）。

FXPTQ API 範例

import torch
from torch.ao.quantization import (
  get_default_qconfig_mapping,
  get_default_qat_qconfig_mapping,
  QConfigMapping,
)
import torch.ao.quantization.quantize_fx as quantize_fx
import copy

model_fp = UserModel()

#
# post training dynamic/weight_only quantization
#

# we need to deepcopy if we still want to keep model_fp unchanged after quantization since quantization apis change the input model
model_to_quantize = copy.deepcopy(model_fp)
model_to_quantize.eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_dynamic_qconfig)
# a tuple of one or more example inputs are needed to trace the model
example_inputs = (input_fp32)
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# no calibration needed when we only have dynamic/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# post training static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qconfig_mapping("qnnpack")
model_to_quantize.eval()
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# calibrate (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# quantization aware training for static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qat_qconfig_mapping("qnnpack")
model_to_quantize.train()
# prepare
model_prepared = quantize_fx.prepare_qat_fx(model_to_quantize, qconfig_mapping, example_inputs)
# training loop (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# fusion
#
model_to_quantize = copy.deepcopy(model_fp)
model_fused = quantize_fx.fuse_fx(model_to_quantize)

請按照以下教程了解更多關於 FX Graph Mode 量化的資訊

(原型) PyTorch 2 匯出量化¶

API 範例

import torch
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantizer import (
    XNNPACKQuantizer,
    get_symmetric_quantization_config,
)

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 10)

   def forward(self, x):
       return self.linear(x)

# initialize a floating point model
float_model = M().eval()

# define calibration function
def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for image, target in data_loader:
            model(image)

# Step 1. program capture
# NOTE: this API will be updated to torch.export API in the future, but the captured
# result should mostly stay the same
m = capture_pre_autograd_graph(m, *example_inputs)
# we get a model with aten ops

# Step 2. quantization
# backend developer will write their own Quantizer and expose methods to allow
# users to express how they
# want the model to be quantized
quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config())
# or prepare_qat_pt2e for Quantization Aware Training
m = prepare_pt2e(m, quantizer)

# run calibration
# calibrate(m, sample_inference_data)
m = convert_pt2e(m)

# Step 3. lowering
# lower to target backend

請按照這些教程開始使用 PyTorch 2 匯出量化

建模使用者

後端開發人員 (也請查看所有建模使用者文檔)

如何為 PyTorch 2 匯出量化編寫量化器

量化堆疊¶

量化是將浮點模型轉換為量化模型的過程。因此，從高層次來看，量化堆疊可以分為兩個部分：1). 量化模型的構建模組或抽象 2). 將浮點模型轉換為量化模型的量化流程的構建模組或抽象

量化模型¶

量化張量¶

為了在 PyTorch 中進行量化，我們需要能夠在張量中表示量化數據。量化張量允許儲存量化數據（表示為 int8/uint8/int32）以及量化參數，如縮放和零點。量化張量允許許多有用的操作，使量化算術變得容易，此外還允許以量化格式序列化數據。

PyTorch 支援每張量和每通道的對稱和非對稱量化。每張量意味著張量中的所有值都以相同的方式使用相同的量化參數進行量化。每通道意味著對於每個維度，通常是張量的通道維度，張量中的值使用不同的量化參數進行量化。由於異常值只會影響它所在的通道，而不是整個張量，因此可以減少將張量轉換為量化值的錯誤。

映射是通過使用以下方法轉換浮點張量來執行的

$_images/math-quantizer-equation.png$

請注意，我們確保浮點數中的零在量化後以無錯誤的方式表示，從而確保像 padding 這樣的操作不會導致額外的量化錯誤。

以下是量化張量的一些關鍵屬性

QScheme (torch.qscheme)：一個枚舉，指定我們量化張量的方式
- torch.per_tensor_affine
- torch.per_tensor_symmetric
- torch.per_channel_affine
- torch.per_channel_symmetric
dtype (torch.dtype)：量化張量的數據類型
- torch.quint8
- torch.qint8
- torch.qint32
- torch.float16
量化參數 (根據 QScheme 而異)：所選量化方式的參數
- torch.per_tensor_affine 的量化參數為：
  - scale (縮放比例，float)
  - zero_point (零點，int)
- torch.per_channel_affine 的量化參數為：
  - per_channel_scales (每個通道的縮放比例，float 列表)
  - per_channel_zero_points (每個通道的零點，int 列表)
  - axis (軸，int)

量化與反量化¶

模型的輸入和輸出是浮點數 Tensor，但量化模型中的激活值是量化的，因此我們需要運算子在浮點數和量化 Tensor 之間轉換。

量化 (float -> quantized)
- torch.quantize_per_tensor(x, scale, zero_point, dtype)
- torch.quantize_per_channel(x, scales, zero_points, axis, dtype)
- torch.quantize_per_tensor_dynamic(x, dtype, reduce_range)
- to(torch.float16)
反量化 (quantized -> float)
- quantized_tensor.dequantize() - 在 torch.float16 Tensor 上調用 dequantize 會將 Tensor 轉換回 torch.float
- torch.dequantize(x)

量化運算子/模組¶

量化運算子是指將量化 Tensor 作為輸入，並輸出量化 Tensor 的運算子。
量化模組是指執行量化操作的 PyTorch 模組。它們通常針對加權運算（如線性運算和卷積運算）進行定義。

量化引擎¶

當執行量化模型時，qengine (torch.backends.quantized.engine) 指定要用於執行的後端。重要的是要確保 qengine 在量化激活值和權重的取值範圍方面與量化模型相容。

量化流程¶

觀察器和偽量化¶

觀察器 (Observer) 是 PyTorch 模組，用於：
- 收集 Tensor 統計信息，例如通過觀察器的 Tensor 的最小值和最大值
- 並根據收集到的 Tensor 統計信息計算量化參數
偽量化 (FakeQuantize) 是 PyTorch 模組，用於：
- 模擬網路中 Tensor 的量化（執行量化/反量化）
- 它可以根據從觀察器收集的統計信息計算量化參數，也可以學習量化參數

QConfig¶

QConfig 是一個具名元組 (namedtuple)，包含觀察器或偽量化模組類，可以使用 qscheme、dtype 等進行配置。它用於配置運算子應如何被觀察。
- 運算子/模組的量化配置
  - 不同類型的觀察器/偽量化
  - dtype
  - qscheme
  - quant_min/quant_max：可用於模擬低精度 Tensor
- 目前支援激活值和權重的配置
- 我們會根據為給定運算子或模組配置的 qconfig 插入輸入/權重/輸出觀察器

一般量化流程¶

一般來說，流程如下：

prepare (準備)
- 根據使用者指定的 qconfig 插入觀察器/偽量化模組
calibrate/train (校準/訓練，取決於訓練後量化或量化感知訓練)
- 允許觀察器收集統計信息或偽量化模組學習量化參數
convert (轉換)
- 將校準/訓練的模型轉換為量化模型

量化有不同的模式，可以透過兩種方式進行分類：

就我們應用量化流程的位置而言，我們有：

訓練後量化 (Post Training Quantization)（在訓練後應用量化，量化參數基於樣本校準資料計算）
量化感知訓練 (Quantization Aware Training)（在訓練期間模擬量化，以便可以使用訓練資料與模型一起學習量化參數）

就我們如何量化運算子而言，我們可以有：

僅權重量化 (Weight Only Quantization)（僅權重是靜態量化的）
動態量化 (Dynamic Quantization)（權重是靜態量化的，激活值是動態量化的）
靜態量化 (Static Quantization)（權重和激活值都是靜態量化的）

我們可以在同一個量化流程中混合不同的量化運算子方式。例如，我們可以進行訓練後量化，其中同時具有靜態和動態量化的運算子。

量化支援矩陣¶

量化模式支援¶

	量化模式		資料集需求	最適用於	準確度	備註
訓練後量化	動態/僅權重量化	激活值動態量化 (fp16, int8) 或不量化，權重靜態量化 (fp16, int8, in4)	無	LSTM, MLP, Embedding, Transformer	良好	易於使用，當效能受限於權重的計算或記憶體時，接近靜態量化
訓練後量化	靜態量化	激活值和權重靜態量化 (int8)	校準資料集	CNN	良好	提供最佳效能，可能對準確度產生重大影響，適用於僅支援 int8 計算的硬體
量化感知訓練	動態量化	激活值和權重都是偽量化的	微調資料集	MLP, Embedding	最佳	目前支援有限
量化感知訓練	靜態量化	激活值和權重都是偽量化的	微調資料集	CNN, MLP, Embedding	最佳	通常在靜態量化導致準確度不佳時使用，並用於縮小準確度差距

請參閱我們的 Pytorch 量化簡介部落格文章，以獲得對這些量化類型之間權衡的更全面概述。

量化流程支援¶

PyTorch 提供兩種量化模式：Eager 模式量化和 FX Graph 模式量化。

Eager 模式量化是一個 beta 功能。使用者需要手動進行融合，並指定量化和反量化發生的位置，而且它僅支援模組，而不支援 functionals。

FX Graph 模式量化是 PyTorch 中自動化的量化框架，目前是一個原型功能。它透過增加對 functionals 的支援和自動化量化過程來改進 Eager 模式量化，儘管人們可能需要重構模型以使模型與 FX Graph 模式量化相容（可以使用 torch.fx 進行符號追蹤）。請注意，FX Graph 模式量化預期無法在任意模型上運作，因為模型可能無法進行符號追蹤，我們會將其整合到 torchvision 等領域庫中，使用者將能夠使用 FX Graph 模式量化來量化類似於支援領域庫中的模型。對於任意模型，我們將提供一般準則，但要實際使其運作，使用者可能需要熟悉 torch.fx，特別是如何使模型可進行符號追蹤。

建議量化的新使用者首先嘗試 FX Graph 模式量化，如果它無法運作，使用者可以嘗試遵循使用 FX Graph 模式量化的指南或退回到 eager 模式量化。

下表比較了 Eager 模式量化和 FX Graph 模式量化之間的差異

	Eager 模式量化	FX 圖模式量化
發布狀態	beta	原型
運算元融合	手動	自動
Quant/DeQuant 位置	手動	自動
量化模組	支援	支援
量化 Functionals/Torch Ops	手動	自動
自訂支援	有限支援	完全支援
量化模式支援	後訓練量化：靜態、動態、僅權重感知量化訓練：靜態	後訓練量化：靜態、動態、僅權重感知量化訓練：靜態
輸入/輸出模型類型	`torch.nn.Module`	`torch.nn.Module` （可能需要一些重構，以使該模型與 FX 圖模式量化兼容）

後端/硬體支援¶

硬體	核心函式庫	Eager 模式量化	FX 圖模式量化	量化模式支援
伺服器 CPU	fbgemm/onednn	支援		全部支援
行動 CPU	qnnpack/xnnpack	支援		全部支援
伺服器 GPU	TensorRT (早期原型)	不支援，因為它需要圖 (graph)	支援	靜態量化

目前，PyTorch 支援以下後端以高效執行量化運算子：

透過 x86（由 fbgemm 和 onednn 最佳化）支援 AVX2 或更高版本的 x86 CPU（沒有 AVX2，某些運算有低效的實作）（請參閱 RFC 中的詳細資訊）
ARM CPU（通常用於行動裝置/嵌入式裝置），透過 qnnpack 支援
（早期原型）透過 fx2trt 使用 TensorRT 支援 NVidia GPU（將開源）

原生 CPU 後端的注意事項¶

我們使用相同的原生 PyTorch 量化運算子公開 x86 和 qnnpack，因此我們需要額外的標誌來區分它們。 x86 和 qnnpack 的對應實作會根據 PyTorch 的建置模式自動選擇，但使用者可以透過將 torch.backends.quantization.engine 設定為 x86 或 qnnpack 來覆寫此設定。

在準備量化模型時，必須確保 qconfig 和用於量化計算的引擎與模型將執行的後端相符。 qconfig 控制量化過程中使用的觀察者類型。 qengine 控制在為線性 (linear) 和卷積 (convolution) 函數及模組打包權重時，是否使用 x86 或 qnnpack 特定的打包函數。例如

x86 的預設設定

# set the qconfig for PTQ
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default on x86 CPUs
qconfig = torch.ao.quantization.get_default_qconfig('x86')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'x86'

qnnpack 的預設設定

# set the qconfig for PTQ
qconfig = torch.ao.quantization.get_default_qconfig('qnnpack')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('qnnpack')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'qnnpack'

運算子支援¶

動態和靜態量化之間的運算子覆蓋範圍有所不同，並在下表中捕獲。請注意，對於 FX Graph Mode Quantization，也支援相應的 functionals。

	靜態量化	動態量化
nn.Linear nn.Conv1d/2d/3d	Y Y	Y N
nn.LSTM nn.GRU	N N	Y Y
nn.RNNCell nn.GRUCell nn.LSTMCell	N N N	Y Y Y
nn.EmbeddingBag	Y (啟動在 fp32 中)	Y
nn.Embedding	Y	Y
nn.MultiheadAttention	不支援	不支援
啟動	廣泛支援	未更改，計算保持在 fp32 中

注意：這將很快更新一些從原生 backend_config_dict 產生的資訊。

量化 API 參考¶

量化 API 參考包含量化 API 的文件，例如量化 passes、量化 tensor 運算以及支援的量化模組和函數。

量化後端配置¶

量化後端配置包含有關如何為各種後端配置量化工作流程的文件。

量化準確性偵錯¶

量化準確性偵錯包含有關如何偵錯量化準確性的文件。

量化客製化¶

雖然提供了觀察者 (observers) 的預設實作來根據觀察到的 tensor 資料選擇縮放因子 (scale factor) 和偏差 (bias)，但開發人員可以提供自己的量化函數。量化可以選擇性地應用於模型的不同部分，或者針對模型的不同部分進行不同的配置。

我們還提供對 conv1d()、conv2d()、conv3d() 和 linear() 的每個通道 (per channel) 量化的支援。

量化工作流程透過在模型的模組層次結構中新增（例如，新增觀察者作為 .observer 子模組）或替換（例如，將 nn.Conv2d 轉換為 nn.quantized.Conv2d）子模組來工作。這意味著模型在整個過程中仍然是一個常規的基於 nn.Module 的實例，因此可以與 PyTorch API 的其餘部分一起使用。

量化自定義模組 API¶

Eager mode 和 FX graph mode 量化 API 都提供了一個 hook，供使用者以自定義方式指定量化的模組，並使用使用者定義的邏輯進行觀察和量化。使用者需要指定

來源 fp32 模組（存在於模型中）的 Python 類型
觀察到的模組（由使用者提供）的 Python 類型。此模組需要定義一個 from_float 函數，該函數定義如何從原始 fp32 模組建立觀察到的模組。
量化模組（由使用者提供）的 Python 類型。此模組需要定義一個 from_observed 函數，該函數定義如何從觀察到的模組建立量化模組。
一個配置，用於描述 (1)、(2)、(3) 以上，傳遞給量化 API。

然後框架將執行以下操作

在 prepare 模組交換期間，它會使用 (2) 類別的 from_float 函數，將 (1) 中指定的類型的每個模組轉換為 (2) 中指定的類型。
在 convert 模組交換期間，它會使用 (3) 類別的 from_observed 函數，將 (2) 中指定的類型的每個模組轉換為 (3) 中指定的類型。

目前，ObservedCustomModule 必須具有單個 Tensor 輸出，並且框架（而不是使用者）將在該輸出上新增一個觀察者。觀察者將作為自定義模組實例的屬性儲存在 activation_post_process 鍵下。放寬這些限制可能會在將來進行。

自定義 API 範例

import torch
import torch.ao.nn.quantized as nnq
from torch.ao.quantization import QConfigMapping
import torch.ao.quantization.quantize_fx

# original fp32 module to replace
class CustomModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(3, 3)

    def forward(self, x):
        return self.linear(x)

# custom observed module, provided by user
class ObservedCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_float(cls, float_module):
        assert hasattr(float_module, 'qconfig')
        observed = cls(float_module.linear)
        observed.qconfig = float_module.qconfig
        return observed

# custom quantized module, provided by user
class StaticQuantCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_observed(cls, observed_module):
        assert hasattr(observed_module, 'qconfig')
        assert hasattr(observed_module, 'activation_post_process')
        observed_module.linear.activation_post_process = \
            observed_module.activation_post_process
        quantized = cls(nnq.Linear.from_float(observed_module.linear))
        return quantized

#
# example API call (Eager mode quantization)
#

m = torch.nn.Sequential(CustomModule()).eval()
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        CustomModule: ObservedCustomModule
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        ObservedCustomModule: StaticQuantCustomModule
    }
}
m.qconfig = torch.ao.quantization.default_qconfig
mp = torch.ao.quantization.prepare(
    m, prepare_custom_config_dict=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.convert(
    mp, convert_custom_config_dict=convert_custom_config_dict)
#
# example API call (FX graph mode quantization)
#
m = torch.nn.Sequential(CustomModule()).eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_qconfig)
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        "static": {
            CustomModule: ObservedCustomModule,
        }
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        "static": {
            ObservedCustomModule: StaticQuantCustomModule,
        }
    }
}
mp = torch.ao.quantization.quantize_fx.prepare_fx(
    m, qconfig_mapping, torch.randn(3,3), prepare_custom_config=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.quantize_fx.convert_fx(
    mp, convert_custom_config=convert_custom_config_dict)

最佳實踐¶

1. 如果您使用 x86 後端，我們需要使用 7 位元而不是 8 位元。請確保減少 quant\_min、quant\_max 的範圍，例如，如果 dtype 是 torch.quint8，請確保設定自定義 quant_min 為 0 且 quant_max 為 127（255 / 2）；如果 dtype 是 torch.qint8，請確保設定自定義 quant_min 為 -64（-128 / 2）且 quant_max 為 63（127 / 2），如果您呼叫 torch.ao.quantization.get_default_qconfig(backend) 或 torch.ao.quantization.get_default_qat_qconfig(backend) 函數以取得 x86 或 qnnpack 後端的預設 qconfig，我們已經正確設定了此值

2. 如果選擇 onednn 後端，預設的 qconfig 映射 torch.ao.quantization.get_default_qconfig_mapping('onednn') 和預設的 qconfig torch.ao.quantization.get_default_qconfig('onednn') 將會使用 8 位元作為 activation。建議在支援向量神經網路指令 (VNNI) 的 CPU 上使用。否則，將 activation 的 observer 的 reduce_range 設定為 True，以便在不支援 VNNI 的 CPU 上獲得更好的準確度。

常見問題¶

如何在 GPU 上執行量化推論？

我們目前尚未提供官方的 GPU 支援，但這是一個積極開發的領域，您可以在這裡找到更多資訊。
在哪裡可以獲得量化模型的 ONNX 支援？

如果您匯出模型時遇到錯誤 (使用 torch.onnx 下的 API)，您可以在 PyTorch 儲存庫中開啟一個 issue。請在 issue 標題加上 [ONNX] 前綴，並將 issue 標記為 module: onnx。

如果您在使用 ONNX Runtime 時遇到問題，請在 GitHub - microsoft/onnxruntime 開啟一個 issue。
如何將量化與 LSTM 結合使用？

LSTM 通過我們的自定義模組 API 在 eager 模式和 fx graph 模式量化中都得到支援。範例可以在這裡找到：Eager 模式：pytorch/test_quantized_op.py TestQuantizedOps.test_custom_module_lstm FX Graph 模式：pytorch/test_quantize_fx.py TestQuantizeFx.test_static_lstm

常見錯誤¶

將未量化的 Tensor 傳遞到量化核心 (kernel)¶

如果您看到類似以下的錯誤：

RuntimeError: Could not run 'quantized::some_operator' with arguments from the 'CPU' backend...

這表示您試圖將一個未量化的 Tensor 傳遞到一個量化核心。一個常見的解決方法是使用 torch.ao.quantization.QuantStub 來量化該 Tensor。這需要在 Eager 模式量化中手動完成。一個端到端 (e2e) 範例：

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)

    def forward(self, x):
        # during the convert step, this will be replaced with a
        # `quantize_per_tensor` call
        x = self.quant(x)
        x = self.conv(x)
        return x

將量化的 Tensor 傳遞到未量化的核心 (kernel)¶