注意
點擊 這裡 以下載完整的範例程式碼
PyTorch 基準測試¶
建立於:2020 年 12 月 02 日 | 最後更新:2023 年 5 月 09 日 | 最後驗證:2024 年 11 月 05 日
本食譜提供一個快速入門指南,說明如何使用 PyTorch benchmark
模組來測量和比較程式碼效能。
簡介¶
基準測試是編寫程式碼的重要步驟。它幫助我們驗證程式碼是否符合效能預期、比較解決相同問題的不同方法,以及防止效能退化。
在基準測試 PyTorch 程式碼時有很多選項,包括 Python 內建的 timeit
模組。然而,基準測試 PyTorch 程式碼有很多容易被忽略的注意事項,例如管理執行緒數量和同步 CUDA 裝置。此外,為基準測試產生張量輸入可能非常繁瑣。
本食譜示範如何使用 PyTorch benchmark
模組來避免常見錯誤,同時更輕鬆地比較不同程式碼的效能、為基準測試產生輸入等等。
步驟¶
定義要進行基準測試的函數
使用
timeit.Timer
進行基準測試使用
torch.utils.benchmark.Timer
進行基準測試使用
Blocked Autorange
進行基準測試比較基準測試結果
儲存/載入基準測試結果
使用
Fuzzed Parameters
產生輸入使用
Callgrind
收集指令計數
1. 定義要進行基準測試的函數¶
在撰寫本文時,torch.dot 不支援批次模式,因此我們將比較使用現有 torch
運算元實作它的兩種方法:一種方法使用 mul
和 sum
的組合,而另一種方法將問題簡化為 bmm
。
import torch
def batched_dot_mul_sum(a, b):
'''Computes batched dot by multiplying and summing'''
return a.mul(b).sum(-1)
def batched_dot_bmm(a, b):
'''Computes batched dot by reducing to ``bmm``'''
a = a.reshape(-1, 1, a.shape[-1])
b = b.reshape(-1, b.shape[-1], 1)
return torch.bmm(a, b).flatten(-3)
# Input for benchmarking
x = torch.randn(10000, 64)
# Ensure that both functions compute the same output
assert batched_dot_mul_sum(x, x).allclose(batched_dot_bmm(x, x))
2. 使用 timeit.Timer
進行基準測試¶
首先,讓我們使用 Python 內建的 timeit
模組來進行程式碼基準測試。我們在這裡保持基準測試程式碼簡單,以便我們可以比較 timeit
和 torch.utils.benchmark
的預設值。
import timeit
t0 = timeit.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='from __main__ import batched_dot_mul_sum',
globals={'x': x})
t1 = timeit.Timer(
stmt='batched_dot_bmm(x, x)',
setup='from __main__ import batched_dot_bmm',
globals={'x': x})
print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us')
mul_sum(x, x): 111.6 us
bmm(x, x): 70.0 us
3. 使用 torch.utils.benchmark.Timer
進行基準測試¶
PyTorch benchmark
模組的設計讓那些之前使用過 timeit
模組的人感到熟悉。然而,它的預設值使其更容易且更安全地用於基準測試 PyTorch 程式碼。讓我們先比較與上述相同的基本 API。
import torch.utils.benchmark as benchmark
t0 = benchmark.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='from __main__ import batched_dot_mul_sum',
globals={'x': x})
t1 = benchmark.Timer(
stmt='batched_dot_bmm(x, x)',
setup='from __main__ import batched_dot_bmm',
globals={'x': x})
print(t0.timeit(100))
print(t1.timeit(100))
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb10400d0f0>
batched_dot_mul_sum(x, x)
setup: from __main__ import batched_dot_mul_sum
379.29 us
1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb103d67048>
batched_dot_bmm(x, x)
setup: from __main__ import batched_dot_bmm
716.42 us
1 measurement, 100 runs , 1 thread
即使基本功能的 API 相同,也存在一些重要的差異。benchmark.Timer.timeit()
傳回每次運行的時間,而不是像 timeit.Timer.timeit()
那樣傳回總運行時間。PyTorch benchmark
模組也提供格式化的字串表示形式來列印結果。
另一個重要的差異,也是結果不同的原因是,PyTorch 基準測試模組預設在單個執行緒中運行。我們可以透過 num_threads
參數來變更執行緒數量。
torch.utils.benchmark.Timer
接受幾個額外的參數,包括:label
、sub_label
、description
和 env
,它們會變更傳回的測量物件的 __repr__,並用於分組結果(稍後會詳細介紹)。
num_threads = torch.get_num_threads()
print(f'Benchmarking on {num_threads} threads')
t0 = benchmark.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='from __main__ import batched_dot_mul_sum',
globals={'x': x},
num_threads=num_threads,
label='Multithreaded batch dot',
sub_label='Implemented using mul and sum')
t1 = benchmark.Timer(
stmt='batched_dot_bmm(x, x)',
setup='from __main__ import batched_dot_bmm',
globals={'x': x},
num_threads=num_threads,
label='Multithreaded batch dot',
sub_label='Implemented using bmm')
print(t0.timeit(100))
print(t1.timeit(100))
Benchmarking on 40 threads
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb103d54080>
Multithreaded batch dot: Implemented using mul and sum
setup: from __main__ import batched_dot_mul_sum
118.47 us
1 measurement, 100 runs , 40 threads
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb16935d2e8>
Multithreaded batch dot: Implemented using bmm
setup: from __main__ import batched_dot_bmm
68.21 us
1 measurement, 100 runs , 40 threads
使用所有可用執行緒執行 benchmark
產生的結果與 timeit
模組相似。更重要的是,哪個版本更快取決於我們使用多少執行緒來執行程式碼。這就是為什麼使用代表實際使用案例的執行緒設定來評估程式碼效能非常重要的原因。另一個需要記住的重要事項是在 GPU 上評估效能時,需同步 CPU 和 CUDA。讓我們在 CUDA 張量上再次運行上述基準測試,看看會發生什麼。
x = torch.randn(10000, 1024, device='cuda')
t0 = timeit.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='from __main__ import batched_dot_mul_sum',
globals={'x': x})
t1 = timeit.Timer(
stmt='batched_dot_bmm(x, x)',
setup='from __main__ import batched_dot_bmm',
globals={'x': x})
# Ran each twice to show difference before/after warm-up
print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us')
mul_sum(x, x): 27.6 us
mul_sum(x, x): 25.3 us
bmm(x, x): 2775.5 us
bmm(x, x): 22.4 us
t0 = benchmark.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='from __main__ import batched_dot_mul_sum',
globals={'x': x})
t1 = benchmark.Timer(
stmt='batched_dot_bmm(x, x)',
setup='from __main__ import batched_dot_bmm',
globals={'x': x})
# Run only once since benchmark module does warm-up for us
print(t0.timeit(100))
print(t1.timeit(100))
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb10400d080>
batched_dot_mul_sum(x, x)
setup: from __main__ import batched_dot_mul_sum
232.93 us
1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb10400d0f0>
batched_dot_bmm(x, x)
setup: from __main__ import batched_dot_bmm
181.04 us
1 measurement, 100 runs , 1 thread
結果揭示了一些有趣的東西。使用 timeit
模組運行的第一個 bmm
版本所需的時間比第二次運行長得多。這是因為 bmm
調用了 cuBLAS,需要第一次調用時載入,這需要一些時間。這就是為什麼在基準測試之前進行預熱運轉非常重要的原因,幸運的是,PyTorch 的 benchmark
模組會處理這個問題。
timeit
和 benchmark
模組之間的結果差異,是因為 timeit 模組沒有同步 CUDA,因此僅計時啟動核心所需的時間。PyTorch 的 benchmark
模組會為我們執行同步。
4. 使用 Blocked Autorange 進行基準測試¶
雖然 timeit.Timer.autorange
進行至少 0.2 秒的單一連續測量,但 torch.utils.benchmark.blocked_autorange 進行多次測量,這些測量的總時間至少為 0.2 秒(可以通過 min_run_time 參數更改),但受到時間開銷是整體測量的一小部分的限制。 這是通過首先以每個迴圈中增加的運行次數運行,直到運行時遠大於測量開銷(這也用作預熱),然後進行測量直到達到目標時間來完成的。 這樣做的優點是不會浪費太多數據,並允許我們計算統計數據以估計測量的可靠性。
m0 = t0.blocked_autorange()
m1 = t1.blocked_autorange()
print(m0)
print(m1)
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb10400d0f0>
batched_dot_mul_sum(x, x)
setup: from __main__ import batched_dot_mul_sum
231.79 us
1 measurement, 1000 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb10400d080>
batched_dot_bmm(x, x)
setup: from __main__ import batched_dot_bmm
Median: 162.08 us
2 measurements, 1000 runs per measurement, 1 thread
我們還可以從返回的測量物件中檢查各個統計數據。
print(f"Mean: {m0.mean * 1e6:6.2f} us")
print(f"Median: {m0.median * 1e6:6.2f} us")
Mean: 231.79 us
Median: 231.79 us
5. 比較基準測試結果¶
到目前為止,我們一直在針對單個輸入比較兩個批次點乘版本。 實際上,我們希望嘗試輸入的組合以及不同數量的執行緒。 Compare
類別有助於在格式化的表格中顯示許多測量的結果。 它使用上述註釋(label、sub_label、num_threads 等)以及 description 來分組和組織表格。 讓我們使用 Compare
來看看我們的函數對於不同的輸入大小和執行緒數量的表現如何。
from itertools import product
# Compare takes a list of measurements which we'll save in results.
results = []
sizes = [1, 64, 1024, 10000]
for b, n in product(sizes, sizes):
# label and sub_label are the rows
# description is the column
label = 'Batched dot'
sub_label = f'[{b}, {n}]'
x = torch.ones((b, n))
for num_threads in [1, 4, 16, 32]:
results.append(benchmark.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='from __main__ import batched_dot_mul_sum',
globals={'x': x},
num_threads=num_threads,
label=label,
sub_label=sub_label,
description='mul/sum',
).blocked_autorange(min_run_time=1))
results.append(benchmark.Timer(
stmt='batched_dot_bmm(x, x)',
setup='from __main__ import batched_dot_bmm',
globals={'x': x},
num_threads=num_threads,
label=label,
sub_label=sub_label,
description='bmm',
).blocked_autorange(min_run_time=1))
compare = benchmark.Compare(results)
compare.print()
[--------------- Batched dot ----------------]
| mul/sum | bmm
1 threads: -----------------------------------
[1, 1] | 5.9 | 11.2
[1, 64] | 6.4 | 11.4
[1, 1024] | 6.7 | 14.2
[1, 10000] | 10.2 | 23.7
[64, 1] | 6.3 | 11.5
[64, 64] | 8.6 | 15.4
[64, 1024] | 39.4 | 204.4
[64, 10000] | 274.9 | 748.5
[1024, 1] | 7.7 | 17.8
[1024, 64] | 40.3 | 76.4
[1024, 1024] | 432.4 | 2795.9
[1024, 10000] | 22657.3 | 11899.5
[10000, 1] | 16.9 | 74.8
[10000, 64] | 300.3 | 609.4
[10000, 1024] | 23098.6 | 27246.1
[10000, 10000] | 267073.7 | 118823.7
4 threads: -----------------------------------
[1, 1] | 6.0 | 11.5
[1, 64] | 6.2 | 11.2
[1, 1024] | 6.8 | 14.3
[1, 10000] | 10.2 | 23.7
[64, 1] | 6.3 | 16.2
[64, 64] | 8.8 | 18.2
[64, 1024] | 41.5 | 189.1
[64, 10000] | 91.7 | 849.1
[1024, 1] | 7.6 | 17.4
[1024, 64] | 43.5 | 33.5
[1024, 1024] | 135.4 | 2782.3
[1024, 10000] | 7471.1 | 11874.0
[10000, 1] | 16.8 | 33.9
[10000, 64] | 118.7 | 173.2
[10000, 1024] | 7264.6 | 27824.7
[10000, 10000] | 100060.9 | 121499.0
16 threads: ----------------------------------
[1, 1] | 6.0 | 11.3
[1, 64] | 6.2 | 11.2
[1, 1024] | 6.9 | 14.2
[1, 10000] | 10.3 | 23.8
[64, 1] | 6.4 | 24.1
[64, 64] | 9.0 | 23.8
[64, 1024] | 54.1 | 188.5
[64, 10000] | 49.9 | 748.0
[1024, 1] | 7.6 | 23.4
[1024, 64] | 55.5 | 28.2
[1024, 1024] | 66.9 | 2773.9
[1024, 10000] | 6111.5 | 12833.7
[10000, 1] | 16.9 | 27.5
[10000, 64] | 59.5 | 73.7
[10000, 1024] | 6295.9 | 27062.0
[10000, 10000] | 71804.5 | 120365.8
32 threads: ----------------------------------
[1, 1] | 5.9 | 11.3
[1, 64] | 6.2 | 11.3
[1, 1024] | 6.7 | 14.2
[1, 10000] | 10.5 | 23.8
[64, 1] | 6.3 | 31.7
[64, 64] | 9.1 | 30.4
[64, 1024] | 72.0 | 190.4
[64, 10000] | 103.1 | 746.9
[1024, 1] | 7.6 | 28.4
[1024, 64] | 70.5 | 31.9
[1024, 1024] | 65.6 | 2804.6
[1024, 10000] | 6764.0 | 11871.4
[10000, 1] | 17.8 | 31.8
[10000, 64] | 110.3 | 56.0
[10000, 1024] | 6640.2 | 27592.2
[10000, 10000] | 73003.4 | 120083.2
Times are in microseconds (us).
上面的結果表明,對於在多個執行緒上運行的大型張量,減少到 bmm
的版本更好,而對於較小和/或單執行緒程式碼,另一個版本更好。
Compare
還提供了用於更改表格格式的函數
compare.trim_significant_figures()
compare.colorize()
compare.print()
6. 儲存/載入基準測試結果¶
Measurements(和 CallgrindStats
,將在第 8 節中描述)可以透過 pickle
模組進行序列化。 這使得 A/B 測試變得容易,因為您可以從兩個單獨的環境中收集測量結果,對它們進行 pickle 處理,然後在單個環境中載入兩者。 Timer 甚至採用 env 建構函式引數,以便此類 A/B 測試能夠無縫運作。
讓我們想像一下,加法/求和和 bmm
方法不是兩個 Python 函數,而是 PyTorch 的兩個不同版本。 下面的範例示範了一種可能的 A/B 測試方法。 為了簡化起見,我們僅使用形狀的一個子集,並且僅通過 pickle 來回傳輸結果,而不是實際使用多個環境並將結果寫入磁碟。
import pickle
ab_test_results = []
for env in ('environment A: mul/sum', 'environment B: bmm'):
for b, n in ((1, 1), (1024, 10000), (10000, 1)):
x = torch.ones((b, n))
dot_fn = (batched_dot_mul_sum if env == 'environment A: mul/sum' else batched_dot_bmm)
m = benchmark.Timer(
stmt='batched_dot(x, x)',
globals={'x': x, 'batched_dot': dot_fn},
num_threads=1,
label='Batched dot',
description=f'[{b}, {n}]',
env=env,
).blocked_autorange(min_run_time=1)
ab_test_results.append(pickle.dumps(m))
ab_results = [pickle.loads(i) for i in ab_test_results]
compare = benchmark.Compare(ab_results)
compare.trim_significant_figures()
compare.colorize()
compare.print()
[------------------------------------- Batched dot -------------------------------------]
| [1, 1] | [1024, 10000] | [10000, 1]
1 threads: ------------------------------------------------------------------------------
(environment A: mul/sum) batched_dot(x, x) | 7 | 36000 | 21
(environment B: bmm) batched_dot(x, x) | 14 | 40000 | 85
Times are in microseconds (us).
# And just to show that we can round trip all of the results from earlier:
round_tripped_results = pickle.loads(pickle.dumps(results))
assert(str(benchmark.Compare(results)) == str(benchmark.Compare(round_tripped_results)))
7. 使用 Fuzzed Parameters 產生輸入¶
正如我們在上一節中看到的,根據輸入張量,效能可能會有一些明顯的差異。 因此,最好在許多不同的輸入上運行基準測試。 但是,建立所有這些輸入張量可能很繁瑣,這就是 torch.utils.benchmark.Fuzzer
和相關類別的用途。 讓我們看看如何使用 Fuzzer
為基準測試建立一些測試案例。
from torch.utils.benchmark import Fuzzer, FuzzedParameter, FuzzedTensor, ParameterAlias
# Generates random tensors with 128 to 10000000 elements and sizes k0 and k1 chosen from a
# ``loguniform`` distribution in [1, 10000], 40% of which will be discontiguous on average.
example_fuzzer = Fuzzer(
parameters = [
FuzzedParameter('k0', minval=1, maxval=10000, distribution='loguniform'),
FuzzedParameter('k1', minval=1, maxval=10000, distribution='loguniform'),
],
tensors = [
FuzzedTensor('x', size=('k0', 'k1'), min_elements=128, max_elements=10000000, probability_contiguous=0.6)
],
seed=0,
)
results = []
for tensors, tensor_params, params in example_fuzzer.take(10):
# description is the column label
sub_label=f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}"
results.append(benchmark.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='from __main__ import batched_dot_mul_sum',
globals=tensors,
label='Batched dot',
sub_label=sub_label,
description='mul/sum',
).blocked_autorange(min_run_time=1))
results.append(benchmark.Timer(
stmt='batched_dot_bmm(x, x)',
setup='from __main__ import batched_dot_bmm',
globals=tensors,
label='Batched dot',
sub_label=sub_label,
description='bmm',
).blocked_autorange(min_run_time=1))
compare = benchmark.Compare(results)
compare.trim_significant_figures()
compare.print()
[--------------------- Batched dot ---------------------]
| mul/sum | bmm
1 threads: ----------------------------------------------
725 x 257 | 87 | 180
49 x 383 | 15 | 30
34 x 1468 | 30 | 118
187 x 5039 | 400 | 1200
2140 x 1296 (discontiguous) | 2000 | 41000
78 x 1598 | 74 | 310
519 x 763 | 190 | 1500
141 x 1082 | 87 | 500
78 x 5 (discontiguous) | 9 | 20
187 x 1 | 12 | 10
Times are in microseconds (us).
定義自己的 fuzzers
具有很大的靈活性,這非常適合建立一組強大的輸入來進行基準測試。 但是為了使事情變得更加簡單,PyTorch 基準測試模組帶有一些內建的 fuzzers
,以滿足常見的基準測試需求。 讓我們看看如何使用其中一個內建的 fuzzers
。
from torch.utils.benchmark.op_fuzzers import binary
results = []
for tensors, tensor_params, params in binary.BinaryOpFuzzer(seed=0).take(10):
sub_label=f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}"
results.append(benchmark.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='from __main__ import batched_dot_mul_sum',
globals=tensors,
label='Batched dot',
sub_label=sub_label,
description='mul/sum',
).blocked_autorange(min_run_time=1))
results.append(benchmark.Timer(
stmt='batched_dot_bmm(x, x)',
setup='from __main__ import batched_dot_bmm',
globals=tensors,
label='Batched dot',
sub_label=sub_label,
description='bmm',
).blocked_autorange(min_run_time=1))
compare = benchmark.Compare(results)
compare.trim_significant_figures()
compare.colorize(rowwise=True)
compare.print()
[----------------------- Batched dot ------------------------]
| mul/sum | bmm
1 threads: ---------------------------------------------------
64 x 473 (discontiguous) | 10000 | 40000
16384 x 12642115 (discontiguous) | 31 | 78
8192 x 892 | 4800 | 20400
512 x 64 (discontiguous) | 110000 | 400000
493 x 27 (discontiguous) | 1100 | 2440
118 x 32 (discontiguous) | 870 | 2030
16 x 495 (discontiguous) | 23600 | 24000
488 x 62374 | 90000 | 100000
240372 x 69 | 40000 | 16000
40156 x 32 (discontiguous) | 2670 | 5000
Times are in microseconds (us).
8. 使用 Callgrind
收集指令計數¶
優化程式碼的挑戰之一是掛鐘時間的變化和不透明性。 存在許多不確定性的來源,從自適應時脈速度到與其他程序的資源爭用。 此外,端對端時間無法深入了解時間的流逝位置,這才是我們優化程式碼時真正關心的問題。
一種補充方法是也收集指令計數。 這些計數是一個代理度量,並不能捕獲效能的所有方面(例如,記憶體或 I/O 綁定任務),但是它們確實具有一些有用的屬性。 指令計數是可重現的,對環境變化不敏感,並且可以深入了解程序花費週期的地方。
為了了解指令計數的實用性,讓我們看看如何減少 batched_dot_mul_sum 的開銷。 明顯的解決方案是將其移至 C++,因此我們可以避免多次在 Python 和 C++ 之間往返。
幸運的是,原始碼幾乎相同。 我們必須在 C++ 中提出的一個問題是,我們應該按值還是按引用取得引數。
batched_dot_src = """\
/* ---- Python ---- */
// def batched_dot_mul_sum(a, b):
// return a.mul(b).sum(-1)
torch::Tensor batched_dot_mul_sum_v0(
const torch::Tensor a,
const torch::Tensor b) {
return a.mul(b).sum(-1);
}
torch::Tensor batched_dot_mul_sum_v1(
const torch::Tensor& a,
const torch::Tensor& b) {
return a.mul(b).sum(-1);
}
"""
# PyTorch makes it easy to test our C++ implementations by providing a utility
# to JIT compile C++ source into Python extensions:
import os
from torch.utils import cpp_extension
cpp_lib = cpp_extension.load_inline(
name='cpp_lib',
cpp_sources=batched_dot_src,
extra_cflags=['-O3'],
extra_include_paths=[
# `load_inline` needs to know where to find ``pybind11`` headers.
os.path.join(os.getenv('CONDA_PREFIX'), 'include')
],
functions=['batched_dot_mul_sum_v0', 'batched_dot_mul_sum_v1']
)
# `load_inline` will create a shared object that is loaded into Python. When we collect
# instruction counts Timer will create a subprocess, so we need to re-import it. The
# import process is slightly more complicated for C extensions, but that's all we're
# doing here.
module_import_str = f"""\
# https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-full-path
import importlib.util
spec = importlib.util.spec_from_file_location("cpp_lib", {repr(cpp_lib.__file__)})
cpp_lib = importlib.util.module_from_spec(spec)
spec.loader.exec_module(cpp_lib)"""
import textwrap
def pretty_print(result):
"""Import machinery for ``cpp_lib.so`` can get repetitive to look at."""
print(repr(result).replace(textwrap.indent(module_import_str, " "), " import cpp_lib"))
t_baseline = benchmark.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='''\
from __main__ import batched_dot_mul_sum
x = torch.randn(2, 2)''')
t0 = benchmark.Timer(
stmt='cpp_lib.batched_dot_mul_sum_v0(x, x)',
setup=f'''\
{module_import_str}
x = torch.randn(2, 2)''')
t1 = benchmark.Timer(
stmt='cpp_lib.batched_dot_mul_sum_v1(x, x)',
setup=f'''\
{module_import_str}
x = torch.randn(2, 2)''')
# Moving to C++ did indeed reduce overhead, but it's hard to tell which
# calling convention is more efficient. v1 (call with references) seems to
# be a bit faster, but it's within measurement error.
pretty_print(t_baseline.blocked_autorange())
pretty_print(t0.blocked_autorange())
pretty_print(t1.blocked_autorange())
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb16935d2e8>
batched_dot_mul_sum(x, x)
setup:
from __main__ import batched_dot_mul_sum
x = torch.randn(2, 2)
6.92 us
1 measurement, 100000 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb16935d2e8>
cpp_lib.batched_dot_mul_sum_v0(x, x)
setup:
import cpp_lib
x = torch.randn(2, 2)
5.29 us
1 measurement, 100000 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb16935d2e8>
cpp_lib.batched_dot_mul_sum_v1(x, x)
setup:
import cpp_lib
x = torch.randn(2, 2)
5.22 us
1 measurement, 100000 runs , 1 thread
# Let's use ``Callgrind`` to determine which is better.
stats_v0 = t0.collect_callgrind()
stats_v1 = t1.collect_callgrind()
pretty_print(stats_v0)
pretty_print(stats_v1)
# `.as_standardized` removes file names and some path prefixes, and makes
# it easier to read the function symbols.
stats_v0 = stats_v0.as_standardized()
stats_v1 = stats_v1.as_standardized()
# `.delta` diffs the instruction counts, and `.denoise` removes several
# functions in the Python interpreter that are known to have significant
# jitter.
delta = stats_v1.delta(stats_v0).denoise()
# `.transform` is a convenience API for transforming function names. It is
# useful for increasing cancelation when ``diff-ing`` instructions, as well as
# just generally improving readability.
replacements = (
("???:void pybind11", "pybind11"),
("batched_dot_mul_sum_v0", "batched_dot_mul_sum_v1"),
("at::Tensor, at::Tensor", "..."),
("at::Tensor const&, at::Tensor const&", "..."),
("auto torch::detail::wrap_pybind_function_impl_", "wrap_pybind_function_impl_"),
)
for before, after in replacements:
delta = delta.transform(lambda l: l.replace(before, after))
# We can use print options to control how much of the function to display.
torch.set_printoptions(linewidth=160)
# Once parsed, the instruction counts make clear that passing `a` and `b`
# by reference is more efficient as it skips some ``c10::TensorImpl`` bookkeeping
# for the intermediate Tensors, and is also works better with ``pybind11``. This
# is consistent with our noisy wall time observations.
print(delta)
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fb0f06e7630>
cpp_lib.batched_dot_mul_sum_v0(x, x)
setup:
import cpp_lib
x = torch.randn(2, 2)
All Noisy symbols removed
Instructions: 2392671 2392671
Baseline: 4367 4367
100 runs per measurement, 1 thread
Warning: PyTorch was not built with debug symbols.
Source information may be limited. Rebuild with
REL_WITH_DEB_INFO=1 for more detailed results.
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fb10400d208>
cpp_lib.batched_dot_mul_sum_v1(x, x)
setup:
import cpp_lib
x = torch.randn(2, 2)
All Noisy symbols removed
Instructions: 2378978 2378978
Baseline: 4367 4367
100 runs per measurement, 1 thread
Warning: PyTorch was not built with debug symbols.
Source information may be limited. Rebuild with
REL_WITH_DEB_INFO=1 for more detailed results.
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7fb1000ab358>
86 ???:0x000000000020d9e0
56 ???:0x000000000020db10
-1100 pybind11::cpp_function::initialize<wrap_pybind_function_impl_<at::Tensor ... r (&)(...), std::integer_sequence<unsigned long, 0ul, 1ul>)::{lambda(...)
-1600 ???:wrap_pybind_function_impl_<at::Tensor (&)(...), 0ul, 1ul>(at::Tensor (&)(...), std::integer_sequence<unsigned long, 0ul, 1ul>)::{lambda(...)
-5200 ???:c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_()
-5935 ???:0x000000000022c0e0
Total: -13693