(原型) PyTorch 2 匯出後訓練量化¶

建立於：2023 年 10 月 02 日 | 最後更新：2024 年 10 月 23 日 | 最後驗證：2024 年 11 月 05 日

本教學介紹基於 torch._export.export 在圖模式下執行後訓練靜態量化的步驟。與 FX 圖模式量化相比，此流程預計具有顯著更高的模型覆蓋率 (在 14K 模型上為 88%)、更好的可編程性和簡化的 UX。

可被 torch.export.export 匯出是使用該流程的先決條件，您可以在 Export DB 中找到支援的結構。

帶有量化器的 quantization 2 的高階架構可能如下所示

float_model(Python)                          Example Input
    \                                              /
     \                                            /
—-------------------------------------------------------
|                        export                        |
—-------------------------------------------------------
                            |
                    FX Graph in ATen     Backend Specific Quantizer
                            |                       /
—--------------------------------------------------------
|                     prepare_pt2e                      |
—--------------------------------------------------------
                            |
                     Calibrate/Train
                            |
—--------------------------------------------------------
|                    convert_pt2e                       |
—--------------------------------------------------------
                            |
                    Quantized Model
                            |
—--------------------------------------------------------
|                       Lowering                        |
—--------------------------------------------------------
                            |
        Executorch, Inductor or <Other Backends>

PyTorch 2 匯出量化 API 如下所示

import torch
class M(torch.nn.Module):
   def __init__(self):
      super().__init__()
      self.linear = torch.nn.Linear(5, 10)

   def forward(self, x):
      return self.linear(x)


example_inputs = (torch.randn(1, 5),)
m = M().eval()

# Step 1. program capture
# This is available for pytorch 2.5+, for more details on lower pytorch versions
# please check `Export the model with torch.export` section
m = torch.export.export_for_training(m, example_inputs).module()
# we get a model with aten ops


# Step 2. quantization
from torch.ao.quantization.quantize_pt2e import (
  prepare_pt2e,
  convert_pt2e,
)

from torch.ao.quantization.quantizer.xnnpack_quantizer import (
  XNNPACKQuantizer,
  get_symmetric_quantization_config,
)
# backend developer will write their own Quantizer and expose methods to allow
# users to express how they
# want the model to be quantized
quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config())
m = prepare_pt2e(m, quantizer)

# calibration omitted

m = convert_pt2e(m)
# we have a model with aten ops doing integer computations when possible

PyTorch 2 匯出量化的動機¶

在 2 之前的 PyTorch 版本中，我們有 FX 圖模式量化，它使用 QConfigMapping 和 BackendConfig 進行自訂。QConfigMapping 允許模型使用者指定他們希望如何量化模型，BackendConfig 允許後端開發人員指定其後端中支援的量化方式。雖然該 API 相對良好地涵蓋了大多數用例，但它並非完全可擴充。目前的 API 有兩個主要限制

使用現有物件（QConfig 和 QConfigMapping）表達複雜運算子模式的量化意圖（運算子模式應如何觀察/量化）的限制。
使用者如何表達他們希望如何量化模型的意圖的支援有限。例如，如果使用者想要量化模型中的每隔一個線性層，或者量化行為取決於張量的實際形狀（例如，僅當線性層具有 3D 輸入時才觀察/量化輸入和輸出），則後端開發人員或模型使用者需要更改核心量化 API/流程。

一些改進可以使現有流程更好

我們使用 QConfigMapping 和 BackendConfig 作為單獨的物件，QConfigMapping 描述使用者希望如何量化模型的意圖，BackendConfig 描述後端支援的量化種類。BackendConfig 是後端特定的，但 QConfigMapping 不是，使用者可以提供與特定 BackendConfig 不相容的 QConfigMapping，這不是一個好的 UX。理想情況下，我們可以透過使配置 (QConfigMapping) 和量化能力 (BackendConfig) 都是後端特定的來更好地組織這個，這樣就不會有關於不相容性的混淆。
在 QConfig 中，我們將觀察器/fake_quant 觀察器類別公開為物件，供使用者配置量化，這增加了使用者可能需要關心的事項。例如，不僅是 dtype，還有觀察應該如何進行，這些可能會對使用者隱藏，以便簡化使用者流程。

以下是新 API 優點的摘要

可編程性（解決 1. 和 2.）：當可用量化器無法滿足使用者的量化需求時，使用者可以建立自己的量化器，並將其與其他量化器組合，如上所述。
簡化 UX (對應第 3 點)：提供單一實例，後端和使用者皆與之互動。因此，您不再需要使用者介面量化配置映射來映射使用者意圖，以及與後端互動的單獨量化配置來配置後端支援的內容。我們仍然會提供一種方法，讓使用者查詢量化器中支援的功能。透過單一實例，組合不同的量化能力也比以前更自然。

例如，XNNPACK 不支援 embedding_byte，而我們在 ExecuTorch 中原生支援此功能。因此，如果我們有一個僅量化 embedding_byte 的 ExecuTorchQuantizer，則可以將其與 XNNPACKQuantizer 組合。（以前，這需要將兩個 BackendConfig 連接在一起，並且由於 QConfigMapping 中的選項不是後端特定的，使用者還需要自己弄清楚如何指定與組合後端的量化能力相符的配置。使用單一量化器實例，我們可以組合兩個量化器並查詢組合後的量化器的能力，這使得它更不容易出錯且更簡潔，例如 composed_quantizer.quantization_capabilities())。）
關注點分離 (對應第 4 點)：在設計量化器 API 時，我們還將量化規範（以 dtype、最小值/最大值（位元數）、對稱等表示）與觀察器概念分離。目前，觀察器同時捕獲量化規範以及如何觀察（直方圖與 MinMax 觀察器）。透過此變更，建模使用者可以擺脫與觀察器和偽量化物件的互動。

定義輔助函式並準備資料集¶

我們將從執行必要的匯入、定義一些輔助函式並準備資料開始。這些步驟與 PyTorch 中使用 Eager Mode 的靜態量化完全相同。

若要使用整個 ImageNet 資料集執行本教學課程中的程式碼，請先按照 ImageNet 資料此處的指示下載 ImageNet。將下載的檔案解壓縮到 data_path 資料夾中。

下載 torchvision resnet18 模型並將其重新命名為 data/resnet18_pretrained_float.pth。

import os
import sys
import time
import numpy as np

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

import torchvision
from torchvision import datasets
from torchvision.models.resnet import resnet18
import torchvision.transforms as transforms

# Set up warnings
import warnings
warnings.filterwarnings(
    action='ignore',
    category=DeprecationWarning,
    module=r'.*'
)
warnings.filterwarnings(
    action='default',
    module=r'torch.ao.quantization'
)

# Specify random seed for repeatable results
_ = torch.manual_seed(191009)


class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


def accuracy(output, target, topk=(1,)):
    """
    Computes the accuracy over the k top predictions for the specified
    values of k.
    """
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res


def evaluate(model, criterion, data_loader):
    model.eval()
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    cnt = 0
    with torch.no_grad():
        for image, target in data_loader:
            output = model(image)
            loss = criterion(output, target)
            cnt += 1
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            top1.update(acc1[0], image.size(0))
            top5.update(acc5[0], image.size(0))
    print('')

    return top1, top5

def load_model(model_file):
    model = resnet18(pretrained=False)
    state_dict = torch.load(model_file, weights_only=True)
    model.load_state_dict(state_dict)
    model.to("cpu")
    return model

def print_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    print("Size (MB):", os.path.getsize("temp.p")/1e6)
    os.remove("temp.p")

def prepare_data_loaders(data_path):
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    dataset = torchvision.datasets.ImageNet(
        data_path, split="train", transform=transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]))
    dataset_test = torchvision.datasets.ImageNet(
        data_path, split="val", transform=transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ]))

    train_sampler = torch.utils.data.RandomSampler(dataset)
    test_sampler = torch.utils.data.SequentialSampler(dataset_test)

    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=train_batch_size,
        sampler=train_sampler)

    data_loader_test = torch.utils.data.DataLoader(
        dataset_test, batch_size=eval_batch_size,
        sampler=test_sampler)

    return data_loader, data_loader_test

data_path = '~/.data/imagenet'
saved_model_dir = 'data/'
float_model_file = 'resnet18_pretrained_float.pth'

train_batch_size = 30
eval_batch_size = 50

data_loader, data_loader_test = prepare_data_loaders(data_path)
example_inputs = (next(iter(data_loader))[0])
criterion = nn.CrossEntropyLoss()
float_model = load_model(saved_model_dir + float_model_file).to("cpu")
float_model.eval()

# create another instance of the model since
# we need to keep the original model around
model_to_quantize = load_model(saved_model_dir + float_model_file).to("cpu")

將模型設定為 eval 模式¶

對於訓練後量化，我們需要將模型設定為 eval 模式。

model_to_quantize.eval()

使用 torch.export 匯出模型¶

以下是如何使用 torch.export 匯出模型的方法

example_inputs = (torch.rand(2, 3, 224, 224),)
# for pytorch 2.5+
exported_model = torch.export.export_for_training(model_to_quantize, example_inputs).module()

# for pytorch 2.4 and before
# from torch._export import capture_pre_autograd_graph
# exported_model = capture_pre_autograd_graph(model_to_quantize, example_inputs)

# or capture with dynamic dimensions
# for pytorch 2.5+
dynamic_shapes = tuple(
  {0: torch.export.Dim("dim")} if i == 0 else None
  for i in range(len(example_inputs))
)
exported_model = torch.export.export_for_training(model_to_quantize, example_inputs, dynamic_shapes=dynamic_shapes).module()

# for pytorch 2.4 and before
# dynamic_shape API may vary as well
# from torch._export import dynamic_dim
# exported_model = capture_pre_autograd_graph(model_to_quantize, example_inputs, constraints=[dynamic_dim(example_inputs[0], 0)])

匯入後端特定的量化器並配置如何量化模型¶

以下程式碼片段描述如何量化模型

from torch.ao.quantization.quantizer.xnnpack_quantizer import (
  XNNPACKQuantizer,
  get_symmetric_quantization_config,
)
quantizer = XNNPACKQuantizer()
quantizer.set_global(get_symmetric_quantization_config())

Quantizer 是後端特定的，每個 Quantizer 都會提供自己的方式，讓使用者配置其模型。僅舉例來說，以下是 XNNPackQuantizer 支援的不同配置 API

quantizer.set_global(qconfig_opt)  # qconfig_opt is an optional quantization config
    .set_object_type(torch.nn.Conv2d, qconfig_opt) # can be a module type
    .set_object_type(torch.nn.functional.linear, qconfig_opt) # or torch functional op
    .set_module_name("foo.bar", qconfig_opt)

注意

請查看我們的教學課程，其中描述如何撰寫新的 Quantizer。

準備模型以進行訓練後量化¶

prepare_pt2e 將 BatchNorm 運算子摺疊到前面的 Conv2d 運算子中，並在模型中的適當位置插入觀察器。

prepared_model = prepare_pt2e(exported_model, quantizer)
print(prepared_model.graph)

校正¶

校正函式在觀察器插入模型後執行。校正的目的是執行一些代表工作負載的範例（例如，訓練資料集的樣本），以便模型中的觀察器能夠觀察張量的統計資料，並且我們稍後可以使用此資訊來計算量化參數。

def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for image, target in data_loader:
            model(image)
calibrate(prepared_model, data_loader_test)  # run calibration on sample data

將校正模型轉換為量化模型¶

convert_pt2e 採用校正模型並產生量化模型。

quantized_model = convert_pt2e(prepared_model)
print(quantized_model)

在此步驟中，我們目前有兩種表示形式可供您選擇，但我們長期提供的確切表示形式可能會根據 PyTorch 使用者的回饋而改變。

Q/DQ 表示形式（預設）

表示形式的先前文件，所有量化運算子都表示為 dequantize -> fp32_op -> qauntize。

def quantized_linear(x_int8, x_scale, x_zero_point, weight_int8, weight_scale, weight_zero_point, bias_fp32, output_scale, output_zero_point):
    x_fp32 = torch.ops.quantized_decomposed.dequantize_per_tensor(
             x_i8, x_scale, x_zero_point, x_quant_min, x_quant_max, torch.int8)
    weight_fp32 = torch.ops.quantized_decomposed.dequantize_per_tensor(
             weight_i8, weight_scale, weight_zero_point, weight_quant_min, weight_quant_max, torch.int8)
    weight_permuted = torch.ops.aten.permute_copy.default(weight_fp32, [1, 0]);
    out_fp32 = torch.ops.aten.addmm.default(bias_fp32, x_fp32, weight_permuted)
    out_i8 = torch.ops.quantized_decomposed.quantize_per_tensor(
    out_fp32, out_scale, out_zero_point, out_quant_min, out_quant_max, torch.int8)
    return out_i8

參考量化模型表示形式

我們將為選定的運算子提供特殊的表示形式，例如，量化線性。其他運算子表示為 dq -> float32_op -> q，並且 q/dq 被分解為更原始的運算子。您可以使用 convert_pt2e(..., use_reference_representation=True) 取得此表示形式。

# Reference Quantized Pattern for quantized linear
def quantized_linear(x_int8, x_scale, x_zero_point, weight_int8, weight_scale, weight_zero_point, bias_fp32, output_scale, output_zero_point):
    x_int16 = x_int8.to(torch.int16)
    weight_int16 = weight_int8.to(torch.int16)
    acc_int32 = torch.ops.out_dtype(torch.mm, torch.int32, (x_int16 - x_zero_point), (weight_int16 - weight_zero_point))
    bias_scale = x_scale * weight_scale
    bias_int32 = out_dtype(torch.ops.aten.div.Tensor, torch.int32, bias_fp32, bias_scale)
    acc_int32 = acc_int32 + bias_int32
    acc_int32 = torch.ops.out_dtype(torch.ops.aten.mul.Scalar, torch.int32, acc_int32, x_scale * weight_scale / output_scale) + output_zero_point
    out_int8 = torch.ops.aten.clamp(acc_int32, qmin, qmax).to(torch.int8)
    return out_int8

有關最新的參考表示形式，請參閱此處。

檢查模型大小和準確度評估¶

現在我們可以將大小和模型準確度與基準模型進行比較。

# Baseline model size and accuracy
print("Size of baseline model")
print_size_of_model(float_model)

top1, top5 = evaluate(float_model, criterion, data_loader_test)
print("Baseline Float Model Evaluation accuracy: %2.2f, %2.2f"%(top1.avg, top5.avg))

# Quantized model size and accuracy
print("Size of model after quantization")
# export again to remove unused weights
quantized_model = torch.export.export_for_training(quantized_model, example_inputs).module()
print_size_of_model(quantized_model)

top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
print("[before serilaization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))

注意

我們現在無法進行效能評估，因為模型尚未降低到目標裝置，它只是 ATen 運算子中量化計算的表示形式。

注意

權重現在仍然是 fp32，我們可能會在將來為量化運算子執行常數傳播以取得整數權重。

如果您想獲得更好的準確度或效能，請嘗試以不同的方式配置 quantizer，並且每個 quantizer 都有自己的配置方式，因此請查閱您正在使用的量化器的文件，以了解更多關於如何更好地控制模型量化的資訊。

儲存與載入量化模型¶

我們將展示如何儲存與載入量化模型。

# 0. Store reference output, for example, inputs, and check evaluation accuracy:
example_inputs = (next(iter(data_loader))[0],)
ref = quantized_model(*example_inputs)
top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
print("[before serialization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))

# 1. Export the model and Save ExportedProgram
pt2e_quantized_model_file_path = saved_model_dir + "resnet18_pt2e_quantized.pth"
# capture the model to get an ExportedProgram
quantized_ep = torch.export.export(quantized_model, example_inputs)
# use torch.export.save to save an ExportedProgram
torch.export.save(quantized_ep, pt2e_quantized_model_file_path)


# 2. Load the saved ExportedProgram
loaded_quantized_ep = torch.export.load(pt2e_quantized_model_file_path)
loaded_quantized_model = loaded_quantized_ep.module()

# 3. Check results for example inputs and check evaluation accuracy again:
res = loaded_quantized_model(*example_inputs)
print("diff:", ref - res)

top1, top5 = evaluate(loaded_quantized_model, criterion, data_loader_test)
print("[after serialization/deserialization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))

輸出

[before serialization] Evaluation accuracy on test dataset: 79.82, 94.55
diff: tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

[after serialization/deserialization] Evaluation accuracy on test dataset: 79.82, 94.55

偵錯量化模型¶

您可以使用Numeric Suite，它能幫助您在 Eager 模式和 FX Graph 模式中進行偵錯。適用於 PyTorch 2 Export 模型的新版 Numeric Suite 仍在開發中。

降低 (Lowering) 與效能評估¶

此時產生的模型並非最終在裝置上執行的模型，而是一個參考量化模型，它擷取使用者預期的量化計算，表示為 ATen 運算子和一些額外的量化/反量化運算子。為了獲得能在真實裝置上運行的模型，我們需要降低 (Lowering) 模型。例如，對於在邊緣裝置上運行的模型，我們可以透過委派 (delegation) 和 ExecuTorch 運行時運算子來降低模型。

結論¶

在本教學中，我們使用了 XNNPACKQuantizer 來探討 PyTorch 2 Export 量化的整體流程，並獲得了一個量化模型，該模型可以進一步降低到支援 XNNPACK 後端推論的後端。若要將其用於您自己的後端，請先遵循教學並為您的後端實作一個 Quantizer，然後使用該 Quantizer 量化模型。