注意
點擊這裡下載完整的範例程式碼
多語音訊強制對齊¶
作者: Xiaohui Zhang, Moto Hira.
本教學示範如何將非英語的轉錄文本與語音對齊。
對齊非英語 (標準化) 轉錄文本的過程與對齊英語 (標準化) 轉錄文本的過程相同,而英語的過程在 CTC 強制對齊教學中有詳細說明。在本教學中,我們使用 TorchAudio 的高階 API,torchaudio.pipelines.Wav2Vec2FABundle
,它封裝了預訓練模型、分詞器和對齊器,以更少的程式碼執行強制對齊。
import torch
import torchaudio
print(torch.__version__)
print(torchaudio.__version__)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
2.6.0
2.6.0
cuda
from typing import List
import IPython
import matplotlib.pyplot as plt
建立管線¶
首先,我們實例化模型和預處理/後處理管線。
下圖說明了對齊的過程。
data:image/s3,"s3://crabby-images/1c155/1c155d6050778b9e45ba517d3ed6fcbd0fc06833" alt="https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2fabundle.png"
波形傳遞到聲學模型,該模型產生 token 的機率分佈序列。轉錄文本傳遞到分詞器,該分詞器將轉錄文本轉換為 token 序列。對齊器從聲學模型和分詞器取得結果,並為每個 token 產生時間戳記。
注意
這個過程期望輸入的轉錄文本已經過標準化。標準化的過程(包括非英語語言的羅馬化)與語言相關,因此本教學不涵蓋,但我們會簡要地探討一下。
聲學模型和 tokenizer 必須使用相同的 token 集合。為了方便創建匹配的處理器,Wav2Vec2FABundle
將預訓練的聲學模型和 tokenizer 關聯在一起。torchaudio.pipelines.MMS_FA
就是這樣的一個實例。
以下程式碼實例化一個預訓練的聲學模型、一個使用與該模型相同 token 集合的 tokenizer 和一個 aligner。
注意
由 MMS_FA
的 get_model()
方法實例化的模型預設包含 <star>
token 的特徵維度。您可以通過傳遞 with_star=False
來禁用此功能。
MMS_FA
的聲學模型是作為研究專案的一部分創建和開源的,Scaling Speech Technology to 1,000+ Languages。它使用來自 1100 多種語言的 23,000 小時的音訊進行訓練。
Tokenizer 只是將標準化的字元映射到整數。您可以按如下方式檢查映射:
print(bundle.get_dict())
{'-': 0, 'a': 1, 'i': 2, 'e': 3, 'n': 4, 'o': 5, 'u': 6, 't': 7, 's': 8, 'r': 9, 'm': 10, 'k': 11, 'l': 12, 'd': 13, 'g': 14, 'h': 15, 'y': 16, 'b': 17, 'p': 18, 'w': 19, 'c': 20, 'v': 21, 'j': 22, 'z': 23, 'f': 24, "'": 25, 'q': 26, 'x': 27, '*': 28}
Aligner 在內部使用 torchaudio.functional.forced_align()
和 torchaudio.functional.merge_tokens()
來推斷輸入 token 的時間戳記。
底層機制的詳細資訊涵蓋在 CTC forced alignment API tutorial 中,因此請參考它。
我們定義一個實用函數,使用上面的模型、tokenizer 和 aligner 執行強制對齊。
def compute_alignments(waveform: torch.Tensor, transcript: List[str]):
with torch.inference_mode():
emission, _ = model(waveform.to(device))
token_spans = aligner(emission[0], tokenizer(transcript))
return emission, token_spans
我們還定義了用於繪製結果和預覽音訊片段的實用函數。
# Compute average score weighted by the span length
def _score(spans):
return sum(s.score * len(s) for s in spans) / sum(len(s) for s in spans)
def plot_alignments(waveform, token_spans, emission, transcript, sample_rate=bundle.sample_rate):
ratio = waveform.size(1) / emission.size(1) / sample_rate
fig, axes = plt.subplots(2, 1)
axes[0].imshow(emission[0].detach().cpu().T, aspect="auto")
axes[0].set_title("Emission")
axes[0].set_xticks([])
axes[1].specgram(waveform[0], Fs=sample_rate)
for t_spans, chars in zip(token_spans, transcript):
t0, t1 = t_spans[0].start, t_spans[-1].end
axes[0].axvspan(t0 - 0.5, t1 - 0.5, facecolor="None", hatch="/", edgecolor="white")
axes[1].axvspan(ratio * t0, ratio * t1, facecolor="None", hatch="/", edgecolor="white")
axes[1].annotate(f"{_score(t_spans):.2f}", (ratio * t0, sample_rate * 0.51), annotation_clip=False)
for span, char in zip(t_spans, chars):
t0 = span.start * ratio
axes[1].annotate(char, (t0, sample_rate * 0.55), annotation_clip=False)
axes[1].set_xlabel("time [second]")
fig.tight_layout()
def preview_word(waveform, spans, num_frames, transcript, sample_rate=bundle.sample_rate):
ratio = waveform.size(1) / num_frames
x0 = int(ratio * spans[0].start)
x1 = int(ratio * spans[-1].end)
print(f"{transcript} ({_score(spans):.2f}): {x0 / sample_rate:.3f} - {x1 / sample_rate:.3f} sec")
segment = waveform[:, x0:x1]
return IPython.display.Audio(segment.numpy(), rate=sample_rate)
標準化文本¶
傳遞到 pipeline 的文本必須事先標準化。標準化的確切過程取決於語言。
沒有明確的單詞邊界的語言(例如中文、日語和韓語)首先需要分詞。有一些專門的工具可以做到這一點,但假設我們已經分詞的文本。
標準化的第一步是羅馬化。uroman 是一個支援多種語言的工具。
這是使用 uroman
將輸入文字檔案羅馬化並將輸出寫入另一個文字檔案的 BASH 命令。
$ echo "des événements d'actualité qui se sont produits durant l'année 1882" > text.txt
$ uroman/bin/uroman.pl < text.txt > text_romanized.txt
$ cat text_romanized.txt
Cette page concerne des evenements d'actualite qui se sont produits durant l'annee 1882
下一步是刪除非字母和標點符號。以下程式碼片段標準化羅馬化的文本。
import re
def normalize_uroman(text):
text = text.lower()
text = text.replace("’", "'")
text = re.sub("([^a-z' ])", " ", text)
text = re.sub(' +', ' ', text)
return text.strip()
with open("text_romanized.txt", "r") as f:
for line in f:
text_normalized = normalize_uroman(line)
print(text_normalized)
在上面的範例上執行腳本會產生以下結果。
cette page concerne des evenements d'actualite qui se sont produits durant l'annee
請注意,在此範例中,由於“1882”未被 uroman
羅馬化,因此在標準化步驟中將其刪除。為了避免這種情況,需要羅馬化數字,但眾所周知這是一項非常重要的任務。
將文字對齊到語音¶
現在我們對多種語言執行強制對齊。
德語¶
text_raw = "aber seit ich bei ihnen das brot hole"
text_normalized = "aber seit ich bei ihnen das brot hole"
url = "https://download.pytorch.org/torchaudio/tutorial-assets/10349_8674_000087.flac"
waveform, sample_rate = torchaudio.load(
url, frame_offset=int(0.5 * bundle.sample_rate), num_frames=int(2.5 * bundle.sample_rate)
)
assert sample_rate == bundle.sample_rate
transcript = text_normalized.split()
tokens = tokenizer(transcript)
emission, token_spans = compute_alignments(waveform, transcript)
num_frames = emission.size(1)
plot_alignments(waveform, token_spans, emission, transcript)
print("Raw Transcript: ", text_raw)
print("Normalized Transcript: ", text_normalized)
IPython.display.Audio(waveform, rate=sample_rate)
data:image/s3,"s3://crabby-images/caa7c/caa7cf796ae9f33069a62c21c05f2508c3c6ec78" alt="Emission"
Raw Transcript: aber seit ich bei ihnen das brot hole
Normalized Transcript: aber seit ich bei ihnen das brot hole
preview_word(waveform, token_spans[0], num_frames, transcript[0])
aber (0.96): 0.222 - 0.464 sec
preview_word(waveform, token_spans[1], num_frames, transcript[1])
seit (0.78): 0.565 - 0.766 sec
preview_word(waveform, token_spans[2], num_frames, transcript[2])
ich (0.91): 0.847 - 0.948 sec
preview_word(waveform, token_spans[3], num_frames, transcript[3])
bei (0.96): 1.028 - 1.190 sec
preview_word(waveform, token_spans[4], num_frames, transcript[4])
ihnen (0.65): 1.331 - 1.532 sec
preview_word(waveform, token_spans[5], num_frames, transcript[5])
das (0.54): 1.573 - 1.774 sec
preview_word(waveform, token_spans[6], num_frames, transcript[6])
brot (0.86): 1.855 - 2.117 sec
preview_word(waveform, token_spans[7], num_frames, transcript[7])
hole (0.71): 2.177 - 2.480 sec
中文¶
中文是一種基於字元的語言,並且其原始書寫形式中沒有明確的單詞級別 tokenization(以空格分隔)。為了獲得單詞級別的對齊,您需要首先使用單詞 tokenizer(例如 “Stanford Tokenizer”)在單詞級別對文本進行 tokenization。但是,如果您只需要字元級別的對齊,則不需要這樣做。
text_raw = "关 服务 高端 产品 仍 处于 供不应求 的 局面"
text_normalized = "guan fuwu gaoduan chanpin reng chuyu gongbuyingqiu de jumian"
assert sample_rate == bundle.sample_rate
transcript = text_normalized.split()
emission, token_spans = compute_alignments(waveform, transcript)
num_frames = emission.size(1)
plot_alignments(waveform, token_spans, emission, transcript)
print("Raw Transcript: ", text_raw)
print("Normalized Transcript: ", text_normalized)
IPython.display.Audio(waveform, rate=sample_rate)
data:image/s3,"s3://crabby-images/1f65e/1f65eb2db837e34d34f2c1711f41ceed0a10f0da" alt="Emission"
Raw Transcript: 关 服务 高端 产品 仍 处于 供不应求 的 局面
Normalized Transcript: guan fuwu gaoduan chanpin reng chuyu gongbuyingqiu de jumian
preview_word(waveform, token_spans[0], num_frames, transcript[0])
guan (0.33): 0.020 - 0.141 sec
preview_word(waveform, token_spans[1], num_frames, transcript[1])
fuwu (0.31): 0.221 - 0.583 sec
preview_word(waveform, token_spans[2], num_frames, transcript[2])
gaoduan (0.74): 0.724 - 1.065 sec
preview_word(waveform, token_spans[3], num_frames, transcript[3])
chanpin (0.73): 1.126 - 1.528 sec
preview_word(waveform, token_spans[4], num_frames, transcript[4])
reng (0.86): 1.608 - 1.809 sec
preview_word(waveform, token_spans[5], num_frames, transcript[5])
chuyu (0.80): 1.849 - 2.151 sec
preview_word(waveform, token_spans[6], num_frames, transcript[6])
gongbuyingqiu (0.93): 2.251 - 2.894 sec
preview_word(waveform, token_spans[7], num_frames, transcript[7])
de (0.98): 2.935 - 3.015 sec
preview_word(waveform, token_spans[8], num_frames, transcript[8])
jumian (0.95): 3.075 - 3.477 sec
波蘭語¶
text_raw = "wtedy ujrzałem na jego brzuchu okrągłą czarną ranę"
text_normalized = "wtedy ujrzalem na jego brzuchu okragla czarna rane"
url = "https://download.pytorch.org/torchaudio/tutorial-assets/5090_1447_000088.flac"
waveform, sample_rate = torchaudio.load(url, num_frames=int(4.5 * bundle.sample_rate))
assert sample_rate == bundle.sample_rate
transcript = text_normalized.split()
emission, token_spans = compute_alignments(waveform, transcript)
num_frames = emission.size(1)
plot_alignments(waveform, token_spans, emission, transcript)
print("Raw Transcript: ", text_raw)
print("Normalized Transcript: ", text_normalized)
IPython.display.Audio(waveform, rate=sample_rate)
data:image/s3,"s3://crabby-images/d8c6e/d8c6e4c393ff5d3a8d0d4a5c9c6ee46870f990b9" alt="Emission"
Raw Transcript: wtedy ujrzałem na jego brzuchu okrągłą czarną ranę
Normalized Transcript: wtedy ujrzalem na jego brzuchu okragla czarna rane
preview_word(waveform, token_spans[0], num_frames, transcript[0])
wtedy (1.00): 0.783 - 1.145 sec
preview_word(waveform, token_spans[1], num_frames, transcript[1])
ujrzalem (0.96): 1.286 - 1.788 sec
preview_word(waveform, token_spans[2], num_frames, transcript[2])
na (1.00): 1.868 - 1.949 sec
preview_word(waveform, token_spans[3], num_frames, transcript[3])
jego (1.00): 2.009 - 2.230 sec
preview_word(waveform, token_spans[4], num_frames, transcript[4])
brzuchu (0.97): 2.330 - 2.732 sec
preview_word(waveform, token_spans[5], num_frames, transcript[5])
okragla (1.00): 2.893 - 3.415 sec
preview_word(waveform, token_spans[6], num_frames, transcript[6])
czarna (0.90): 3.556 - 3.938 sec
preview_word(waveform, token_spans[7], num_frames, transcript[7])
rane (1.00): 4.098 - 4.399 sec
葡萄牙語¶
text_raw = "na imensa extensão onde se esconde o inconsciente imortal"
text_normalized = "na imensa extensao onde se esconde o inconsciente imortal"
url = "https://download.pytorch.org/torchaudio/tutorial-assets/6566_5323_000027.flac"
waveform, sample_rate = torchaudio.load(
url, frame_offset=int(bundle.sample_rate), num_frames=int(4.6 * bundle.sample_rate)
)
assert sample_rate == bundle.sample_rate
transcript = text_normalized.split()
emission, token_spans = compute_alignments(waveform, transcript)
num_frames = emission.size(1)
plot_alignments(waveform, token_spans, emission, transcript)
print("Raw Transcript: ", text_raw)
print("Normalized Transcript: ", text_normalized)
IPython.display.Audio(waveform, rate=sample_rate)
data:image/s3,"s3://crabby-images/fb082/fb08236c3d45c509ee0c56c43a28dc6d3d1e048c" alt="Emission"
Raw Transcript: na imensa extensão onde se esconde o inconsciente imortal
Normalized Transcript: na imensa extensao onde se esconde o inconsciente imortal
preview_word(waveform, token_spans[0], num_frames, transcript[0])
na (1.00): 0.020 - 0.080 sec
preview_word(waveform, token_spans[1], num_frames, transcript[1])
imensa (0.90): 0.120 - 0.502 sec
preview_word(waveform, token_spans[2], num_frames, transcript[2])
extensao (0.92): 0.542 - 1.205 sec
preview_word(waveform, token_spans[3], num_frames, transcript[3])
onde (1.00): 1.446 - 1.667 sec
preview_word(waveform, token_spans[4], num_frames, transcript[4])
se (0.99): 1.748 - 1.828 sec
preview_word(waveform, token_spans[5], num_frames, transcript[5])
esconde (0.99): 1.888 - 2.591 sec
preview_word(waveform, token_spans[6], num_frames, transcript[6])
o (0.98): 2.852 - 2.872 sec
preview_word(waveform, token_spans[7], num_frames, transcript[7])
inconsciente (0.80): 2.933 - 3.897 sec
preview_word(waveform, token_spans[8], num_frames, transcript[8])
imortal (0.86): 3.937 - 4.560 sec
義大利語¶
text_raw = "elle giacean per terra tutte quante"
text_normalized = "elle giacean per terra tutte quante"
url = "https://download.pytorch.org/torchaudio/tutorial-assets/642_529_000025.flac"
waveform, sample_rate = torchaudio.load(url, num_frames=int(4 * bundle.sample_rate))
assert sample_rate == bundle.sample_rate
transcript = text_normalized.split()
emission, token_spans = compute_alignments(waveform, transcript)
num_frames = emission.size(1)
plot_alignments(waveform, token_spans, emission, transcript)
print("Raw Transcript: ", text_raw)
print("Normalized Transcript: ", text_normalized)
IPython.display.Audio(waveform, rate=sample_rate)
data:image/s3,"s3://crabby-images/1ee71/1ee71f8108979d58161c446b01fb02a734ffc0a4" alt="Emission"
Raw Transcript: elle giacean per terra tutte quante
Normalized Transcript: elle giacean per terra tutte quante
preview_word(waveform, token_spans[0], num_frames, transcript[0])
elle (1.00): 0.563 - 0.864 sec
preview_word(waveform, token_spans[1], num_frames, transcript[1])
giacean (0.99): 0.945 - 1.467 sec
preview_word(waveform, token_spans[2], num_frames, transcript[2])
per (1.00): 1.588 - 1.789 sec
preview_word(waveform, token_spans[3], num_frames, transcript[3])
terra (1.00): 1.950 - 2.392 sec
preview_word(waveform, token_spans[4], num_frames, transcript[4])
tutte (1.00): 2.533 - 2.975 sec
preview_word(waveform, token_spans[5], num_frames, transcript[5])
quante (1.00): 3.055 - 3.678 sec
結論¶
在本教程中,我們研究了如何使用 torchaudio 的強制對齊 API 和 Wav2Vec2 預訓練的多語言聲學模型,將語音資料與五種語言的文本對齊。
致謝¶
感謝 Vineel Pratap 和 Zhaoheng Ni 開發和開源強制對齊器 API。
腳本總運行時間: ( 0 分鐘 4.835 秒)