注意
點擊此處下載完整的範例程式碼
ASR 推論與 CUDA CTC 解碼器¶
作者: Yuekai Zhang
本教學展示如何使用基於 CUDA 的 CTC 波束搜尋解碼器執行語音辨識推論。我們在來自 Next-gen Kaldi 專案的預訓練 Zipformer 模型上示範此操作。
概觀¶
波束搜尋解碼的工作原理是透過迭代擴展文字假設(波束)與下一個可能的字元,並在每個時間步僅維護得分最高的假設。
- 底層實作使用 CUDA 加速整個解碼過程
解碼器的數學公式可以在
使用 CUDA CTC 波束搜尋解碼器執行 ASR 推論需要以下組件
聲學模型:從聲學特徵預測建模單元(本教學中的 BPE)的模型
BPE 模型:位元組對編碼 (BPE) 標記器檔案
聲學模型與設定¶
首先,我們匯入必要的實用程式並提取我們正在處理的資料
import torch
import torchaudio
print(torch.__version__)
print(torchaudio.__version__)
2.6.0
2.6.0
import time
from pathlib import Path
import IPython
import sentencepiece as spm
from torchaudio.models.decoder import cuda_ctc_decoder
from torchaudio.utils import download_asset
我們使用在 LibriSpeech 資料集上訓練的預訓練 Zipformer 模型。該模型與 CTC 和 Transducer 損失函數聯合訓練。在本教學中,我們僅使用模型的 CTC 頭。
def download_asset_external(url, key):
path = Path(torch.hub.get_dir()) / "torchaudio" / Path(key)
if not path.exists():
path.parent.mkdir(parents=True, exist_ok=True)
torch.hub.download_url_to_file(url, path)
return str(path)
url_prefix = "https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-ctc-2022-12-01"
model_link = f"{url_prefix}/resolve/main/exp/cpu_jit.pt"
model_path = download_asset_external(model_link, "cuda_ctc_decoder/cpu_jit.pt")
0%| | 0.00/269M [00:00<?, ?B/s]
19%|#9 | 51.5M/269M [00:00<00:00, 539MB/s]
38%|###8 | 103M/269M [00:00<00:00, 500MB/s]
60%|#####9 | 161M/269M [00:00<00:00, 545MB/s]
81%|######## | 218M/269M [00:00<00:00, 567MB/s]
100%|##########| 269M/269M [00:00<00:00, 559MB/s]
我們將從 LibriSpeech test-other 資料集中載入一個樣本。
speech_file = download_asset("tutorial-assets/ctc-decoding/1688-142285-0007.wav")
waveform, sample_rate = torchaudio.load(speech_file)
assert sample_rate == 16000
IPython.display.Audio(speech_file)
0%| | 0.00/441k [00:00<?, ?B/s]
100%|##########| 441k/441k [00:00<00:00, 103MB/s]
與此音訊檔案對應的文本記錄為
解碼器的檔案和資料¶
接下來,我們從 BPE 模型載入我們的符記,這是用於解碼的標記器。
符記¶
符記是聲學模型可以預測的可能符號,包括 CTC 中的空白符號。在本教學中,它包含 500 個 BPE 符記。它可以作為檔案傳入,其中每行由對應於相同索引的符記組成,也可以作為符記列表傳入,每個符記對應於唯一的索引。
# tokens
<blk>
<sos/eos>
<unk>
S
_THE
_A
T
_AND
...
bpe_link = f"{url_prefix}/resolve/main/data/lang_bpe_500/bpe.model"
bpe_path = download_asset_external(bpe_link, "cuda_ctc_decoder/bpe.model")
bpe_model = spm.SentencePieceProcessor()
bpe_model.load(bpe_path)
tokens = [bpe_model.id_to_piece(id) for id in range(bpe_model.get_piece_size())]
print(tokens)
0%| | 0.00/239k [00:00<?, ?B/s]
100%|##########| 239k/239k [00:00<00:00, 83.3MB/s]
['<blk>', '<sos/eos>', '<unk>', 'S', '▁THE', '▁A', 'T', '▁AND', 'ED', '▁OF', '▁TO', 'E', 'D', 'N', 'ING', '▁IN', 'Y', 'M', 'C', '▁I', 'A', 'P', '▁HE', 'R', 'O', 'L', 'RE', 'I', 'U', 'ER', '▁IT', 'LY', '▁THAT', '▁WAS', '▁', '▁S', 'AR', '▁BE', 'F', '▁C', 'IN', 'B', '▁FOR', 'OR', 'LE', "'", '▁HIS', '▁YOU', 'AL', '▁RE', 'V', '▁B', 'G', 'RI', '▁E', '▁WITH', '▁T', '▁AS', 'LL', '▁P', '▁HER', 'ST', '▁HAD', '▁SO', '▁F', 'W', 'CE', '▁IS', 'ND', '▁NOT', 'TH', '▁BUT', 'EN', '▁SHE', '▁ON', 'VE', 'ON', 'SE', '▁DE', 'UR', '▁G', 'CH', 'K', 'TER', '▁AT', 'IT', '▁ME', 'RO', 'NE', 'RA', 'ES', 'IL', 'NG', 'IC', '▁NO', '▁HIM', 'ENT', 'IR', '▁WE', 'H', '▁DO', '▁ALL', '▁HAVE', 'LO', '▁BY', '▁MY', '▁MO', '▁THIS', 'LA', '▁ST', '▁WHICH', '▁CON', '▁THEY', 'CK', 'TE', '▁SAID', '▁FROM', '▁GO', '▁WHO', '▁TH', '▁OR', '▁D', '▁W', 'VER', 'LI', '▁SE', '▁ONE', '▁CA', '▁AN', '▁LA', '▁WERE', 'EL', '▁HA', '▁MAN', '▁FA', '▁EX', 'AD', '▁SU', 'RY', '▁MI', 'AT', '▁BO', '▁WHEN', 'AN', 'THER', 'PP', 'ATION', '▁FI', '▁WOULD', '▁PRO', 'OW', 'ET', '▁O', '▁THERE', '▁HO', 'ION', '▁WHAT', '▁FE', '▁PA', 'US', 'MENT', '▁MA', 'UT', '▁OUT', '▁THEIR', '▁IF', '▁LI', '▁K', '▁WILL', '▁ARE', 'ID', '▁RO', 'DE', 'TION', '▁WA', 'PE', '▁UP', '▁SP', '▁PO', 'IGHT', '▁UN', 'RU', '▁LO', 'AS', 'OL', '▁LE', '▁BEEN', '▁SH', '▁RA', '▁SEE', 'KE', 'UL', 'TED', '▁SA', 'UN', 'UND', 'ANT', '▁NE', 'IS', '▁THEM', 'CI', 'GE', '▁COULD', '▁DIS', 'OM', 'ISH', 'HE', 'EST', '▁SOME', 'ENCE', 'ITY', 'IVE', '▁US', '▁MORE', '▁EN', 'ARD', 'ATE', '▁YOUR', '▁INTO', '▁KNOW', '▁CO', 'ANCE', '▁TIME', '▁WI', '▁YE', 'AGE', '▁NOW', 'TI', 'FF', 'ABLE', '▁VERY', '▁LIKE', 'AM', 'HI', 'Z', '▁OTHER', '▁THAN', '▁LITTLE', '▁DID', '▁LOOK', 'TY', 'ERS', '▁CAN', '▁CHA', '▁AR', 'X', 'FUL', 'UGH', '▁BA', '▁DAY', '▁ABOUT', 'TEN', 'IM', '▁ANY', '▁PRE', '▁OVER', 'IES', 'NESS', 'ME', 'BLE', '▁M', 'ROW', '▁HAS', '▁GREAT', '▁VI', 'TA', '▁AFTER', 'PER', '▁AGAIN', 'HO', 'SH', '▁UPON', '▁DI', '▁HAND', '▁COM', 'IST', 'TURE', '▁STA', '▁THEN', '▁SHOULD', '▁GA', 'OUS', 'OUR', '▁WELL', '▁ONLY', 'MAN', '▁GOOD', '▁TWO', '▁MAR', '▁SAY', '▁HU', 'TING', '▁OUR', 'RESS', '▁DOWN', 'IOUS', '▁BEFORE', '▁DA', '▁NA', 'QUI', '▁MADE', '▁EVERY', '▁OLD', '▁EVEN', 'IG', '▁COME', '▁GRA', '▁RI', '▁LONG', 'OT', 'SIDE', 'WARD', '▁FO', '▁WHERE', 'MO', 'LESS', '▁SC', '▁MUST', '▁NEVER', '▁HOW', '▁CAME', '▁SUCH', '▁RU', '▁TAKE', '▁WO', '▁CAR', 'UM', 'AK', '▁THINK', '▁MUCH', '▁MISTER', '▁MAY', '▁JO', '▁WAY', '▁COMP', '▁THOUGHT', '▁STO', '▁MEN', '▁BACK', '▁DON', 'J', '▁LET', '▁TRA', '▁FIRST', '▁JUST', '▁VA', '▁OWN', '▁PLA', '▁MAKE', 'ATED', '▁HIMSELF', '▁WENT', '▁PI', 'GG', 'RING', '▁DU', '▁MIGHT', '▁PART', '▁GIVE', '▁IMP', '▁BU', '▁PER', '▁PLACE', '▁HOUSE', '▁THROUGH', 'IAN', '▁SW', '▁UNDER', 'QUE', '▁AWAY', '▁LOVE', 'QUA', '▁LIFE', '▁GET', '▁WITHOUT', '▁PASS', '▁TURN', 'IGN', '▁HEAD', '▁MOST', '▁THOSE', '▁SHALL', '▁EYES', '▁COL', '▁STILL', '▁NIGHT', '▁NOTHING', 'ITION', 'HA', '▁TELL', '▁WORK', '▁LAST', '▁NEW', '▁FACE', '▁HI', '▁WORD', '▁FOUND', '▁COUNT', '▁OB', '▁WHILE', '▁SHA', '▁MEAN', '▁SAW', '▁PEOPLE', '▁FRIEND', '▁THREE', '▁ROOM', '▁SAME', '▁THOUGH', '▁RIGHT', '▁CHILD', '▁FATHER', '▁ANOTHER', '▁HEART', '▁WANT', '▁TOOK', 'OOK', '▁LIGHT', '▁MISSUS', '▁OPEN', '▁JU', '▁ASKED', 'PORT', '▁LEFT', '▁JA', '▁WORLD', '▁HOME', '▁WHY', '▁ALWAYS', '▁ANSWER', '▁SEEMED', '▁SOMETHING', '▁GIRL', '▁BECAUSE', '▁NAME', '▁TOLD', '▁NI', '▁HIGH', 'IZE', '▁WOMAN', '▁FOLLOW', '▁RETURN', '▁KNEW', '▁EACH', '▁KIND', '▁JE', '▁ACT', '▁LU', '▁CERTAIN', '▁YEARS', '▁QUITE', '▁APPEAR', '▁BETTER', '▁HALF', '▁PRESENT', '▁PRINCE', 'SHIP', '▁ALSO', '▁BEGAN', '▁HAVING', '▁ENOUGH', '▁PERSON', '▁LADY', '▁WHITE', '▁COURSE', '▁VOICE', '▁SPEAK', '▁POWER', '▁MORNING', '▁BETWEEN', '▁AMONG', '▁KEEP', '▁WALK', '▁MATTER', '▁TEA', '▁BELIEVE', '▁SMALL', '▁TALK', '▁FELT', '▁HORSE', '▁MYSELF', '▁SIX', '▁HOWEVER', '▁FULL', '▁HERSELF', '▁POINT', '▁STOOD', '▁HUNDRED', '▁ALMOST', '▁SINCE', '▁LARGE', '▁LEAVE', '▁PERHAPS', '▁DARK', '▁SUDDEN', '▁REPLIED', '▁ANYTHING', '▁WONDER', '▁UNTIL', 'Q']
建構 CUDA 解碼器¶
在本教學中,我們將建構 CUDA 波束搜尋解碼器。可以使用工廠函數 cuda_ctc_decoder()
建構解碼器。
執行推論¶
現在我們有了資料、聲學模型和解碼器,我們可以執行推論。波束搜尋解碼器的輸出類型為 CUCTCHypothesis
,由預測的符記 ID、單字(對應於符記 ID 的符號)和假設分數組成。回想一下,與波形對應的文本記錄是
actual_transcript = "i really was very much afraid of showing him how much shocked i was at some parts of what he said"
actual_transcript = actual_transcript.split()
device = torch.device("cuda", 0)
acoustic_model = torch.jit.load(model_path)
acoustic_model.to(device)
acoustic_model.eval()
waveform = waveform.to(device)
feat = torchaudio.compliance.kaldi.fbank(waveform, num_mel_bins=80, snip_edges=False)
feat = feat.unsqueeze(0)
feat_lens = torch.tensor(feat.size(1), device=device).unsqueeze(0)
encoder_out, encoder_out_lens = acoustic_model.encoder(feat, feat_lens)
nnet_output = acoustic_model.ctc_output(encoder_out)
log_prob = torch.nn.functional.log_softmax(nnet_output, -1)
print(f"The shape of log_prob: {log_prob.shape}, the shape of encoder_out_lens: {encoder_out_lens.shape}")
The shape of log_prob: torch.Size([1, 175, 500]), the shape of encoder_out_lens: torch.Size([1])
CUDA CTC 解碼器給出以下結果。
results = cuda_decoder(log_prob, encoder_out_lens.to(torch.int32))
beam_search_transcript = bpe_model.decode(results[0][0].tokens).lower()
beam_search_wer = torchaudio.functional.edit_distance(actual_transcript, beam_search_transcript.split()) / len(
actual_transcript
)
print(f"Transcript: {beam_search_transcript}")
print(f"WER: {beam_search_wer}")
Transcript: i really was very much afraid of showing him how much shocked i was at some parts of what he said
WER: 0.0
波束搜尋解碼器參數¶
在本節中,我們將更深入地探討一些不同的參數和權衡。如需完整的可自訂參數列表,請參閱 文件
。
輔助函數¶
def print_decoded(cuda_decoder, bpe_model, log_prob, encoder_out_lens, param, param_value):
start_time = time.monotonic()
results = cuda_decoder(log_prob, encoder_out_lens.to(torch.int32))
decode_time = time.monotonic() - start_time
transcript = bpe_model.decode(results[0][0].tokens).lower()
score = results[0][0].score
print(f"{param} {param_value:<3}: {transcript} (score: {score:.2f}; {decode_time:.4f} secs)")
nbest¶
此參數指示要傳回的最佳假設數量。例如,透過在稍早建構波束搜尋解碼器時設定 nbest=10
,我們現在可以存取前 10 個分數的假設。
i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20280733704566956)
i really was very much afraid of showing him how much shocked i was at some part of what he said (score: -1.7408883571624756)
i really was very much afraid of sheowing him how much shocked i was at some parts of what he said (score: -6.67951774597168)
i reallyly very much afraid of showing him how much shocked i was at some parts of what he said (score: -7.597038745880127)
i really was very much afraid of sheowing him how much shocked i was at some part of what he said (score: -8.224080085754395)
i really was very much afraid of shwing him how much shocked i was at some parts of what he said (score: -8.439373970031738)
i really was very much afraid of showing him how much shocked i was in some parts of what he said (score: -8.781461715698242)
i really was very much afraid of showing him how much shocked i was at some parts of what said (score: -8.883706092834473)
i really was very much afraid of showing him how much shocked i was at some partes of what he said (score: -8.999059677124023)
i really was very much afraid of showing him how much shocked i was at some parts of what he say (score: -9.138861656188965)
波束大小¶
beam_size
參數決定了每個解碼步驟後要保留的最大最佳假設數。使用較大的波束大小可以探索更大範圍的可能假設,從而產生得分更高的假設,但超過某個點後,它不會提供額外的增益。我們建議為 CUDA 波束搜尋解碼器設定 beam_size=10。
在下面的範例中,我們看到隨著波束大小從 1 增加到 3,解碼品質有所提高,但請注意,波束大小為 3 時,提供的輸出與波束大小為 10 時相同。
beam_sizes = [1, 2, 3, 10]
for beam_size in beam_sizes:
beam_search_decoder = cuda_ctc_decoder(
tokens,
nbest=1,
beam_size=beam_size,
blank_skip_threshold=0.95,
)
print_decoded(beam_search_decoder, bpe_model, log_prob, encoder_out_lens, "beam size", beam_size)
beam size 1 : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -1.35; 0.0010 secs)
beam size 2 : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.21; 0.0009 secs)
beam size 3 : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0009 secs)
beam size 10 : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0010 secs)
空白跳過閾值¶
blank_skip_threshold
參數用於修剪具有較大空白機率的幀。使用良好的 blank_skip_threshold 修剪這些幀可以大大加快解碼過程,而不會降低準確性。由於 CTC 的規則,我們將在兩個非空白幀之間至少保留一個空白幀,以避免錯誤地合併兩個連續的相同符號。我們建議為 CUDA 波束搜尋解碼器設定 blank_skip_threshold=0.95。
blank_skip_probs = [0.25, 0.95, 1.0]
for blank_skip_prob in blank_skip_probs:
beam_search_decoder = cuda_ctc_decoder(
tokens,
nbest=10,
beam_size=10,
blank_skip_threshold=blank_skip_prob,
)
print_decoded(beam_search_decoder, bpe_model, log_prob, encoder_out_lens, "blank_skip_threshold", blank_skip_prob)
del cuda_decoder
blank_skip_threshold 0.25: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: -0.01; 0.0009 secs)
blank_skip_threshold 0.95: i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0010 secs)
blank_skip_threshold 1.0: i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.21; 0.0043 secs)
使用 flashlight CPU 解碼器進行基準測試¶
我們使用 librispeech test_other 集基準測試 CUDA 解碼器和 CPU 解碼器之間的吞吐量和準確性。若要重現以下基準測試結果,您可以參考此處。
解碼器 |
設定 |
WER (%) |
N-Best Oracle WER (%) |
解碼器成本時間(秒) |
---|---|---|---|---|
CUDA 解碼器 |
blank_skip_threshold 0.95 |
5.81 |
4.11 |
2.57 |
CUDA 解碼器 |
blank_skip_threshold 1.0 (無幀跳過) |
5.81 |
4.09 |
6.24 |
CPU 解碼器 |
beam_size_token 10 |
5.86 |
4.30 |
28.61 |
CPU 解碼器 |
beam_size_token 500 |
5.86 |
4.30 |
791.80 |
從上表可以看出,CUDA 解碼器可以在 WER 方面提供略微的改進,並顯著提高吞吐量。
腳本總執行時間: ( 0 分鐘 2.023 秒)