注意

點擊此處下載完整的範例程式碼

ASR 推論與 CUDA CTC 解碼器¶

作者: Yuekai Zhang

本教學展示如何使用基於 CUDA 的 CTC 波束搜尋解碼器執行語音辨識推論。我們在來自 Next-gen Kaldi 專案的預訓練 Zipformer 模型上示範此操作。

概觀¶

波束搜尋解碼的工作原理是透過迭代擴展文字假設（波束）與下一個可能的字元，並在每個時間步僅維護得分最高的假設。

底層實作使用 CUDA 加速整個解碼過程: 解碼器的數學公式可以在

論文中找到，更詳細的演算法可以在這篇部落格中找到。

使用 CUDA CTC 波束搜尋解碼器執行 ASR 推論需要以下組件

聲學模型：從聲學特徵預測建模單元（本教學中的 BPE）的模型
BPE 模型：位元組對編碼 (BPE) 標記器檔案

聲學模型與設定¶

首先，我們匯入必要的實用程式並提取我們正在處理的資料

import torch
import torchaudio

print(torch.__version__)
print(torchaudio.__version__)

2.6.0
2.6.0

import time
from pathlib import Path

import IPython
import sentencepiece as spm
from torchaudio.models.decoder import cuda_ctc_decoder
from torchaudio.utils import download_asset

我們使用在 LibriSpeech 資料集上訓練的預訓練 Zipformer 模型。該模型與 CTC 和 Transducer 損失函數聯合訓練。在本教學中，我們僅使用模型的 CTC 頭。

def download_asset_external(url, key):
    path = Path(torch.hub.get_dir()) / "torchaudio" / Path(key)
    if not path.exists():
        path.parent.mkdir(parents=True, exist_ok=True)
        torch.hub.download_url_to_file(url, path)
    return str(path)


url_prefix = "https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-ctc-2022-12-01"
model_link = f"{url_prefix}/resolve/main/exp/cpu_jit.pt"
model_path = download_asset_external(model_link, "cuda_ctc_decoder/cpu_jit.pt")

  0%|          | 0.00/269M [00:00<?, ?B/s]
 19%|#9        | 51.5M/269M [00:00<00:00, 539MB/s]
 38%|###8      | 103M/269M [00:00<00:00, 500MB/s]
 60%|#####9    | 161M/269M [00:00<00:00, 545MB/s]
 81%|########  | 218M/269M [00:00<00:00, 567MB/s]
100%|##########| 269M/269M [00:00<00:00, 559MB/s]

我們將從 LibriSpeech test-other 資料集中載入一個樣本。

speech_file = download_asset("tutorial-assets/ctc-decoding/1688-142285-0007.wav")
waveform, sample_rate = torchaudio.load(speech_file)
assert sample_rate == 16000
IPython.display.Audio(speech_file)

  0%|          | 0.00/441k [00:00<?, ?B/s]
100%|##########| 441k/441k [00:00<00:00, 103MB/s]

與此音訊檔案對應的文本記錄為

i really was very much afraid of showing him how much shocked i was at some parts of what he said

解碼器的檔案和資料¶

接下來，我們從 BPE 模型載入我們的符記，這是用於解碼的標記器。

符記¶

符記是聲學模型可以預測的可能符號，包括 CTC 中的空白符號。在本教學中，它包含 500 個 BPE 符記。它可以作為檔案傳入，其中每行由對應於相同索引的符記組成，也可以作為符記列表傳入，每個符記對應於唯一的索引。

# tokens
<blk>
<sos/eos>
<unk>
S
_THE
_A
T
_AND
...

bpe_link = f"{url_prefix}/resolve/main/data/lang_bpe_500/bpe.model"
bpe_path = download_asset_external(bpe_link, "cuda_ctc_decoder/bpe.model")

bpe_model = spm.SentencePieceProcessor()
bpe_model.load(bpe_path)
tokens = [bpe_model.id_to_piece(id) for id in range(bpe_model.get_piece_size())]
print(tokens)

  0%|          | 0.00/239k [00:00<?, ?B/s]
100%|##########| 239k/239k [00:00<00:00, 83.3MB/s]
['<blk>', '<sos/eos>', '<unk>', 'S', '▁THE', '▁A', 'T', '▁AND', 'ED', '▁OF', '▁TO', 'E', 'D', 'N', 'ING', '▁IN', 'Y', 'M', 'C', '▁I', 'A', 'P', '▁HE', 'R', 'O', 'L', 'RE', 'I', 'U', 'ER', '▁IT', 'LY', '▁THAT', '▁WAS', '▁', '▁S', 'AR', '▁BE', 'F', '▁C', 'IN', 'B', '▁FOR', 'OR', 'LE', "'", '▁HIS', '▁YOU', 'AL', '▁RE', 'V', '▁B', 'G', 'RI', '▁E', '▁WITH', '▁T', '▁AS', 'LL', '▁P', '▁HER', 'ST', '▁HAD', '▁SO', '▁F', 'W', 'CE', '▁IS', 'ND', '▁NOT', 'TH', '▁BUT', 'EN', '▁SHE', '▁ON', 'VE', 'ON', 'SE', '▁DE', 'UR', '▁G', 'CH', 'K', 'TER', '▁AT', 'IT', '▁ME', 'RO', 'NE', 'RA', 'ES', 'IL', 'NG', 'IC', '▁NO', '▁HIM', 'ENT', 'IR', '▁WE', 'H', '▁DO', '▁ALL', '▁HAVE', 'LO', '▁BY', '▁MY', '▁MO', '▁THIS', 'LA', '▁ST', '▁WHICH', '▁CON', '▁THEY', 'CK', 'TE', '▁SAID', '▁FROM', '▁GO', '▁WHO', '▁TH', '▁OR', '▁D', '▁W', 'VER', 'LI', '▁SE', '▁ONE', '▁CA', '▁AN', '▁LA', '▁WERE', 'EL', '▁HA', '▁MAN', '▁FA', '▁EX', 'AD', '▁SU', 'RY', '▁MI', 'AT', '▁BO', '▁WHEN', 'AN', 'THER', 'PP', 'ATION', '▁FI', '▁WOULD', '▁PRO', 'OW', 'ET', '▁O', '▁THERE', '▁HO', 'ION', '▁WHAT', '▁FE', '▁PA', 'US', 'MENT', '▁MA', 'UT', '▁OUT', '▁THEIR', '▁IF', '▁LI', '▁K', '▁WILL', '▁ARE', 'ID', '▁RO', 'DE', 'TION', '▁WA', 'PE', '▁UP', '▁SP', '▁PO', 'IGHT', '▁UN', 'RU', '▁LO', 'AS', 'OL', '▁LE', '▁BEEN', '▁SH', '▁RA', '▁SEE', 'KE', 'UL', 'TED', '▁SA', 'UN', 'UND', 'ANT', '▁NE', 'IS', '▁THEM', 'CI', 'GE', '▁COULD', '▁DIS', 'OM', 'ISH', 'HE', 'EST', '▁SOME', 'ENCE', 'ITY', 'IVE', '▁US', '▁MORE', '▁EN', 'ARD', 'ATE', '▁YOUR', '▁INTO', '▁KNOW', '▁CO', 'ANCE', '▁TIME', '▁WI', '▁YE', 'AGE', '▁NOW', 'TI', 'FF', 'ABLE', '▁VERY', '▁LIKE', 'AM', 'HI', 'Z', '▁OTHER', '▁THAN', '▁LITTLE', '▁DID', '▁LOOK', 'TY', 'ERS', '▁CAN', '▁CHA', '▁AR', 'X', 'FUL', 'UGH', '▁BA', '▁DAY', '▁ABOUT', 'TEN', 'IM', '▁ANY', '▁PRE', '▁OVER', 'IES', 'NESS', 'ME', 'BLE', '▁M', 'ROW', '▁HAS', '▁GREAT', '▁VI', 'TA', '▁AFTER', 'PER', '▁AGAIN', 'HO', 'SH', '▁UPON', '▁DI', '▁HAND', '▁COM', 'IST', 'TURE', '▁STA', '▁THEN', '▁SHOULD', '▁GA', 'OUS', 'OUR', '▁WELL', '▁ONLY', 'MAN', '▁GOOD', '▁TWO', '▁MAR', '▁SAY', '▁HU', 'TING', '▁OUR', 'RESS', '▁DOWN', 'IOUS', '▁BEFORE', '▁DA', '▁NA', 'QUI', '▁MADE', '▁EVERY', '▁OLD', '▁EVEN', 'IG', '▁COME', '▁GRA', '▁RI', '▁LONG', 'OT', 'SIDE', 'WARD', '▁FO', '▁WHERE', 'MO', 'LESS', '▁SC', '▁MUST', '▁NEVER', '▁HOW', '▁CAME', '▁SUCH', '▁RU', '▁TAKE', '▁WO', '▁CAR', 'UM', 'AK', '▁THINK', '▁MUCH', '▁MISTER', '▁MAY', '▁JO', '▁WAY', '▁COMP', '▁THOUGHT', '▁STO', '▁MEN', '▁BACK', '▁DON', 'J', '▁LET', '▁TRA', '▁FIRST', '▁JUST', '▁VA', '▁OWN', '▁PLA', '▁MAKE', 'ATED', '▁HIMSELF', '▁WENT', '▁PI', 'GG', 'RING', '▁DU', '▁MIGHT', '▁PART', '▁GIVE', '▁IMP', '▁BU', '▁PER', '▁PLACE', '▁HOUSE', '▁THROUGH', 'IAN', '▁SW', '▁UNDER', 'QUE', '▁AWAY', '▁LOVE', 'QUA', '▁LIFE', '▁GET', '▁WITHOUT', '▁PASS', '▁TURN', 'IGN', '▁HEAD', '▁MOST', '▁THOSE', '▁SHALL', '▁EYES', '▁COL', '▁STILL', '▁NIGHT', '▁NOTHING', 'ITION', 'HA', '▁TELL', '▁WORK', '▁LAST', '▁NEW', '▁FACE', '▁HI', '▁WORD', '▁FOUND', '▁COUNT', '▁OB', '▁WHILE', '▁SHA', '▁MEAN', '▁SAW', '▁PEOPLE', '▁FRIEND', '▁THREE', '▁ROOM', '▁SAME', '▁THOUGH', '▁RIGHT', '▁CHILD', '▁FATHER', '▁ANOTHER', '▁HEART', '▁WANT', '▁TOOK', 'OOK', '▁LIGHT', '▁MISSUS', '▁OPEN', '▁JU', '▁ASKED', 'PORT', '▁LEFT', '▁JA', '▁WORLD', '▁HOME', '▁WHY', '▁ALWAYS', '▁ANSWER', '▁SEEMED', '▁SOMETHING', '▁GIRL', '▁BECAUSE', '▁NAME', '▁TOLD', '▁NI', '▁HIGH', 'IZE', '▁WOMAN', '▁FOLLOW', '▁RETURN', '▁KNEW', '▁EACH', '▁KIND', '▁JE', '▁ACT', '▁LU', '▁CERTAIN', '▁YEARS', '▁QUITE', '▁APPEAR', '▁BETTER', '▁HALF', '▁PRESENT', '▁PRINCE', 'SHIP', '▁ALSO', '▁BEGAN', '▁HAVING', '▁ENOUGH', '▁PERSON', '▁LADY', '▁WHITE', '▁COURSE', '▁VOICE', '▁SPEAK', '▁POWER', '▁MORNING', '▁BETWEEN', '▁AMONG', '▁KEEP', '▁WALK', '▁MATTER', '▁TEA', '▁BELIEVE', '▁SMALL', '▁TALK', '▁FELT', '▁HORSE', '▁MYSELF', '▁SIX', '▁HOWEVER', '▁FULL', '▁HERSELF', '▁POINT', '▁STOOD', '▁HUNDRED', '▁ALMOST', '▁SINCE', '▁LARGE', '▁LEAVE', '▁PERHAPS', '▁DARK', '▁SUDDEN', '▁REPLIED', '▁ANYTHING', '▁WONDER', '▁UNTIL', 'Q']

建構 CUDA 解碼器¶

在本教學中，我們將建構 CUDA 波束搜尋解碼器。可以使用工廠函數 cuda_ctc_decoder() 建構解碼器。

cuda_decoder = cuda_ctc_decoder(tokens, nbest=10, beam_size=10, blank_skip_threshold=0.95)

執行推論¶

現在我們有了資料、聲學模型和解碼器，我們可以執行推論。波束搜尋解碼器的輸出類型為 CUCTCHypothesis，由預測的符記 ID、單字（對應於符記 ID 的符號）和假設分數組成。回想一下，與波形對應的文本記錄是

i really was very much afraid of showing him how much shocked i was at some parts of what he said

actual_transcript = "i really was very much afraid of showing him how much shocked i was at some parts of what he said"
actual_transcript = actual_transcript.split()

device = torch.device("cuda", 0)
acoustic_model = torch.jit.load(model_path)
acoustic_model.to(device)
acoustic_model.eval()

waveform = waveform.to(device)

feat = torchaudio.compliance.kaldi.fbank(waveform, num_mel_bins=80, snip_edges=False)
feat = feat.unsqueeze(0)
feat_lens = torch.tensor(feat.size(1), device=device).unsqueeze(0)

encoder_out, encoder_out_lens = acoustic_model.encoder(feat, feat_lens)
nnet_output = acoustic_model.ctc_output(encoder_out)
log_prob = torch.nn.functional.log_softmax(nnet_output, -1)

print(f"The shape of log_prob: {log_prob.shape}, the shape of encoder_out_lens: {encoder_out_lens.shape}")

The shape of log_prob: torch.Size([1, 175, 500]), the shape of encoder_out_lens: torch.Size([1])

CUDA CTC 解碼器給出以下結果。

results = cuda_decoder(log_prob, encoder_out_lens.to(torch.int32))
beam_search_transcript = bpe_model.decode(results[0][0].tokens).lower()
beam_search_wer = torchaudio.functional.edit_distance(actual_transcript, beam_search_transcript.split()) / len(
    actual_transcript
)

print(f"Transcript: {beam_search_transcript}")
print(f"WER: {beam_search_wer}")

Transcript: i really was very much afraid of showing him how much shocked i was at some parts of what he said
WER: 0.0

波束搜尋解碼器參數¶

在本節中，我們將更深入地探討一些不同的參數和權衡。如需完整的可自訂參數列表，請參閱 文件。

輔助函數¶

def print_decoded(cuda_decoder, bpe_model, log_prob, encoder_out_lens, param, param_value):
    start_time = time.monotonic()
    results = cuda_decoder(log_prob, encoder_out_lens.to(torch.int32))
    decode_time = time.monotonic() - start_time
    transcript = bpe_model.decode(results[0][0].tokens).lower()
    score = results[0][0].score
    print(f"{param} {param_value:<3}: {transcript} (score: {score:.2f}; {decode_time:.4f} secs)")

nbest¶

此參數指示要傳回的最佳假設數量。例如，透過在稍早建構波束搜尋解碼器時設定 nbest=10，我們現在可以存取前 10 個分數的假設。

for i in range(10):
    transcript = bpe_model.decode(results[0][i].tokens).lower()
    score = results[0][i].score
    print(f"{transcript} (score: {score})")

i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20280733704566956)
i really was very much afraid of showing him how much shocked i was at some part of what he said (score: -1.7408883571624756)
i really was very much afraid of sheowing him how much shocked i was at some parts of what he said (score: -6.67951774597168)
i reallyly very much afraid of showing him how much shocked i was at some parts of what he said (score: -7.597038745880127)
i really was very much afraid of sheowing him how much shocked i was at some part of what he said (score: -8.224080085754395)
i really was very much afraid of shwing him how much shocked i was at some parts of what he said (score: -8.439373970031738)
i really was very much afraid of showing him how much shocked i was in some parts of what he said (score: -8.781461715698242)
i really was very much afraid of showing him how much shocked i was at some parts of what said (score: -8.883706092834473)
i really was very much afraid of showing him how much shocked i was at some partes of what he said (score: -8.999059677124023)
i really was very much afraid of showing him how much shocked i was at some parts of what he say (score: -9.138861656188965)

波束大小¶

beam_size 參數決定了每個解碼步驟後要保留的最大最佳假設數。使用較大的波束大小可以探索更大範圍的可能假設，從而產生得分更高的假設，但超過某個點後，它不會提供額外的增益。我們建議為 CUDA 波束搜尋解碼器設定 beam_size=10。

在下面的範例中，我們看到隨著波束大小從 1 增加到 3，解碼品質有所提高，但請注意，波束大小為 3 時，提供的輸出與波束大小為 10 時相同。

beam_sizes = [1, 2, 3, 10]

for beam_size in beam_sizes:
    beam_search_decoder = cuda_ctc_decoder(
        tokens,
        nbest=1,
        beam_size=beam_size,
        blank_skip_threshold=0.95,
    )
    print_decoded(beam_search_decoder, bpe_model, log_prob, encoder_out_lens, "beam size", beam_size)

beam size 1  : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -1.35; 0.0010 secs)
beam size 2  : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.21; 0.0009 secs)
beam size 3  : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0009 secs)
beam size 10 : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0010 secs)

空白跳過閾值¶

blank_skip_threshold 參數用於修剪具有較大空白機率的幀。使用良好的 blank_skip_threshold 修剪這些幀可以大大加快解碼過程，而不會降低準確性。由於 CTC 的規則，我們將在兩個非空白幀之間至少保留一個空白幀，以避免錯誤地合併兩個連續的相同符號。我們建議為 CUDA 波束搜尋解碼器設定 blank_skip_threshold=0.95。

blank_skip_probs = [0.25, 0.95, 1.0]

for blank_skip_prob in blank_skip_probs:
    beam_search_decoder = cuda_ctc_decoder(
        tokens,
        nbest=10,
        beam_size=10,
        blank_skip_threshold=blank_skip_prob,
    )
    print_decoded(beam_search_decoder, bpe_model, log_prob, encoder_out_lens, "blank_skip_threshold", blank_skip_prob)

del cuda_decoder

blank_skip_threshold 0.25: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: -0.01; 0.0009 secs)
blank_skip_threshold 0.95: i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0010 secs)
blank_skip_threshold 1.0: i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.21; 0.0043 secs)

使用 flashlight CPU 解碼器進行基準測試¶

我們使用 librispeech test_other 集基準測試 CUDA 解碼器和 CPU 解碼器之間的吞吐量和準確性。若要重現以下基準測試結果，您可以參考此處。

解碼器	設定	WER (%)	N-Best Oracle WER (%)	解碼器成本時間（秒）
CUDA 解碼器	blank_skip_threshold 0.95	5.81	4.11	2.57
CUDA 解碼器	blank_skip_threshold 1.0 (無幀跳過)	5.81	4.09	6.24
CPU 解碼器	beam_size_token 10	5.86	4.30	28.61
CPU 解碼器	beam_size_token 500	5.86	4.30	791.80

從上表可以看出，CUDA 解碼器可以在 WER 方面提供略微的改進，並顯著提高吞吐量。

腳本總執行時間： ( 0 分鐘 2.023 秒)

由 Sphinx-Gallery 生成的圖庫