權重串流¶

TensorRT 中的權重串流是一項強大的功能，旨在克服處理大型模型時的 GPU 記憶體限制。它能夠透過在推論期間將權重資料從主機 (CPU) 記憶體串流到 GPU 記憶體，來執行大於可用 GPU 記憶體的模型。

串流較大量的記憶體可能會導致效能降低。但是，如果串流權重允許使用者執行更大的批次大小，則可能會帶來更高的輸送量。這種增加的輸送量有時可能超過串流權重造成的效能降低。要串流的最佳記憶體量取決於特定模型和硬體。試驗不同的記憶體限制有助於找到串流額外負荷和批次大小優勢之間的最佳平衡。

此範例使用預先訓練的 Llama-2 模型，並示範如何將權重串流功能與 Torch-TensorRT 搭配使用。

編譯選項 - 使用權重串流功能建置 trt 引擎

執行階段 API - 透過內容管理器控制權重串流預算

匯入和模型定義¶

import copy
import timeit

import numpy as np
import torch
import torch_tensorrt
from transformers import AutoModelForCausalLM
from utils import export_llm


def time_generate(model, inputs, output_seq_length, iterations=10):
    """
    Measure the time for generating a sentence over certain number of iterations
    """
    # We only support single input (B x seq_len) for LLMs now
    input_seq = inputs[0]
    with torch.no_grad():
        timings = []
        for _ in range(iterations):
            start_time = timeit.default_timer()
            inputs_copy = copy.copy(input_seq)
            # Greedy decoding of the model. This generates up to max_tokens.
            while inputs_copy.shape[1] <= output_seq_length:
                outputs = model(inputs_copy)
                logits = outputs.logits
                next_token_logits = logits[:, -1, :]
                next_tokens = torch.argmax(next_token_logits, dim=-1)
                inputs_copy = torch.cat([inputs_copy, next_tokens[:, None]], dim=-1)
            torch.cuda.synchronize()
            end_time = timeit.default_timer()
            timings.append(end_time - start_time)

    times = np.array(timings)
    time_mean_ms = np.mean(times) * 1000

    return time_mean_ms


# Load the LLaMA-2 model
DEVICE = torch.device("cuda:0")
llama_path = "meta-llama/Llama-2-7b-chat-hf"
with torch.no_grad():
    model = AutoModelForCausalLM.from_pretrained(
        llama_path, use_cache=False, attn_implementation="eager"
    ).eval()

# Set input and output sequence lengths
isl = 128
osl = 256

# Create random input tensors
input_tensors = [torch.randint(0, 5, (1, isl), dtype=torch.int64).cuda()]
# Convert the model to half precision (FP16)
model = model.half()
# Exports the LLM model into an ExportedProgram with dynamic shapes.
llama2_ep = export_llm(model, input_tensors[0], max_seq_len=osl)

編譯器選項¶

enable_weight_streaming=True 選項和 use_explicit_typing=True 是使用權重串流功能建置引擎所必需的。use_explicit_typing=True 選項會建立強型別網路，且 enabled_precisions 選項中僅允許 float32 精度

# Create a TensorRT-compiled model
trt_model = torch_tensorrt.dynamo.compile(
    llama2_ep,
    inputs=input_tensors,
    enabled_precisions={torch.float32},
    truncate_double=True,
    device=DEVICE,
    use_explicit_typing=True,
    enable_weight_streaming=True,
)

# Warm up for 3 iterations
_ = time_generate(trt_model, input_tensors, osl, 3)

以自動預算大小執行¶

一旦您指定 enable_weight_streaming 編譯選項，就會配置自動預算大小。此自動大小可能並不總是提供最佳解決方案，因為自動確定的預算缺乏對使用者特定記憶體限制和使用模式的洞察力

# Weight streaming context to get current weight budget information
weight_streaming_ctx = torch_tensorrt.runtime.weight_streaming(trt_model)
# Measure the mean latency of the model with weight streaming
mean_latency = time_generate(trt_model, input_tensors, osl, 1)
# Calculate the percentage of current weight budget used
weight_budget_pct = (
    weight_streaming_ctx.device_budget / weight_streaming_ctx.total_device_budget * 100
)
print(
    f"Set weight streaming budget as {weight_budget_pct}%. {weight_streaming_ctx.device_budget} bytes out of {weight_streaming_ctx.total_device_budget}. mean latency = {mean_latency} ms"
)

使用權重串流內容管理器執行¶

權重串流預算可以使用權重串流內容管理器來限制。預算大小的允許範圍是從 0 到 ctx.total_device_budget。0 表示透過使用最少量的記憶體來實現最大的記憶體節省。值等於 ctx.total_device_budget 將停用權重串流。如果建立多個 trt 引擎，則預算會按比例分配

# Use a context manager for weight streaming
with torch_tensorrt.runtime.weight_streaming(trt_model) as weight_streaming_ctx:
    # Get the total size of streamable weights in the engine
    streamable_budget = weight_streaming_ctx.total_device_budget

    # Scenario 1: Automatic weight streaming budget
    # Get the automatically determined weight streaming budget
    requested_budget = weight_streaming_ctx.get_automatic_weight_streaming_budget()
    # Set the device budget to the automatically determined value
    weight_streaming_ctx.device_budget = requested_budget
    # Measure the mean latency with automatic budget
    mean_latency = time_generate(trt_model, input_tensors, osl, 1)
    # Calculate the percentage of the weight budget used
    weight_budget_pct = (
        weight_streaming_ctx.device_budget
        / weight_streaming_ctx.total_device_budget
        * 100
    )
    print(
        f"Set auto weight streaming budget as {weight_budget_pct}%. {weight_streaming_ctx.device_budget} bytes out of {weight_streaming_ctx.total_device_budget}. mean latency = {mean_latency} ms"
    )

    # Scenario 2: Manual 10% weight streaming budget
    # Set the budget to 10% of the total streamable weights
    requested_budget = int(streamable_budget * 0.1)
    weight_streaming_ctx.device_budget = requested_budget
    # Measure the mean latency with 10% budget
    mean_latency = time_generate(trt_model, input_tensors, osl, 1)
    # Calculate the percentage of the weight budget used
    weight_budget_pct = (
        weight_streaming_ctx.device_budget
        / weight_streaming_ctx.total_device_budget
        * 100
    )
    print(
        f"Set weight streaming budget as {weight_budget_pct}%. {weight_streaming_ctx.device_budget} bytes out of {weight_streaming_ctx.total_device_budget}. mean latency = {mean_latency} ms"
    )

指令碼的總執行時間： (0 分鐘 0.000 秒)

由 Sphinx-Gallery 產生圖庫