注意

點擊這裡下載完整的範例程式碼

使用 Emformer RNN-T 的裝置 ASR¶

作者: Moto Hira, Jeff Hwang。

本教學展示如何使用 Emformer RNN-T 和串流 API 在串流裝置輸入（即筆記型電腦上的麥克風）上執行語音辨識。

注意

本教學需要 FFmpeg 函式庫。請參閱 FFmpeg 相依性以取得詳細資訊。

注意

本教學已在配備 Windows 10 的 MacBook Pro 和 Dynabook 上進行測試。

本教學無法在 Google Colab 上運作，因為執行本教學的伺服器沒有您可以對其說話的麥克風。

1. 概述¶

我們使用串流 API 從音訊裝置（麥克風）逐塊擷取音訊，然後使用 Emformer RNN-T 執行推論。

有關串流 API 和 Emformer RNN-T 的基本用法，請參閱 StreamReader 基本用法和使用 Emformer RNN-T 的線上 ASR。

2. 檢查支援的裝置¶

首先，我們需要檢查串流 API 可以存取的裝置，並找出我們需要傳遞給 StreamReader() 類別的引數 (src 和 format)。

我們使用 ffmpeg 命令來達成此目的。ffmpeg 抽象化了底層硬體實作的差異，但 format 的預期值會因作業系統而異，並且每個 format 都為 src 定義了不同的語法。

支援的 format 值和 src 語法的詳細資訊，可以在 https://ffmpeg.dev.org.tw/ffmpeg-devices.html 找到。

對於 macOS，以下命令會列出可用的裝置。

$ ffmpeg -f avfoundation -list_devices true -i dummy
...
[AVFoundation indev @ 0x126e049d0] AVFoundation video devices:
[AVFoundation indev @ 0x126e049d0] [0] FaceTime HD Camera
[AVFoundation indev @ 0x126e049d0] [1] Capture screen 0
[AVFoundation indev @ 0x126e049d0] AVFoundation audio devices:
[AVFoundation indev @ 0x126e049d0] [0] ZoomAudioDevice
[AVFoundation indev @ 0x126e049d0] [1] MacBook Pro Microphone

我們將對 Streaming API 使用以下值。

StreamReader(
    src = ":1",  # no video, audio from device 1, "MacBook Pro Microphone"
    format = "avfoundation",
)

對於 Windows，dshow 裝置應該可以使用。

> ffmpeg -f dshow -list_devices true -i dummy
...
[dshow @ 000001adcabb02c0] DirectShow video devices (some may be both video and audio devices)
[dshow @ 000001adcabb02c0]  "TOSHIBA Web Camera - FHD"
[dshow @ 000001adcabb02c0]     Alternative name "@device_pnp_\\?\usb#vid_10f1&pid_1a42&mi_00#7&27d916e6&0&0000#{65e8773d-8f56-11d0-a3b9-00a0c9223196}\global"
[dshow @ 000001adcabb02c0] DirectShow audio devices
[dshow @ 000001adcabb02c0]  "... (Realtek High Definition Audio)"
[dshow @ 000001adcabb02c0]     Alternative name "@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\wave_{BF2B8AE1-10B8-4CA4-A0DC-D02E18A56177}"

在上述情況下，可以使用以下值從麥克風進行串流。

StreamReader(
    src = "audio=@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\wave_{BF2B8AE1-10B8-4CA4-A0DC-D02E18A56177}",
    format = "dshow",
)

3. 數據獲取¶

從麥克風輸入串流音訊需要正確地定時數據獲取。否則可能導致數據流中出現不連續。

因此，我們將在子進程中執行數據獲取。

首先，我們建立一個輔助函式，該函式封裝了在子進程中執行的整個過程。

此函式初始化 Streaming API，獲取數據，然後將其放入佇列中，主進程正在監控該佇列。

import torch
import torchaudio


# The data acquisition process will stop after this number of steps.
# This eliminates the need of process synchronization and makes this
# tutorial simple.
NUM_ITER = 100


def stream(q, format, src, segment_length, sample_rate):
    from torchaudio.io import StreamReader

    print("Building StreamReader...")
    streamer = StreamReader(src, format=format)
    streamer.add_basic_audio_stream(frames_per_chunk=segment_length, sample_rate=sample_rate)

    print(streamer.get_src_stream_info(0))
    print(streamer.get_out_stream_info(0))

    print("Streaming...")
    print()
    stream_iterator = streamer.stream(timeout=-1, backoff=1.0)
    for _ in range(NUM_ITER):
        (chunk,) = next(stream_iterator)
        q.put(chunk)

與非裝置串流的顯著區別在於，我們為 stream 方法提供了 timeout 和 backoff 參數。

獲取數據時，如果獲取請求的速率高於硬體準備數據的速率，則底層實作會報告特殊的錯誤代碼，並期望客戶端程式碼重試。

精確的計時是流暢串流的關鍵。在重試之前，從底層實作到 Python 層報告此錯誤會增加不必要的開銷。因此，重試行為在 C++ 層中實作，並且 timeout 和 backoff 參數允許客戶端程式碼控制此行為。

有關 timeout 和 backoff 參數的詳細資訊，請參閱 stream() 方法的文件。

注意

backoff 的適當值取決於系統配置。確定 backoff 值是否合適的一種方法是將獲取的區塊序列另存為連續音訊並收聽它。如果 backoff 值太大，則數據流是不連續的。結果音訊聽起來會加速。如果 backoff 值太小或為零，則音訊流良好，但是數據獲取過程進入忙碌等待狀態，這會增加 CPU 消耗。

4. 建立推論管線¶

下一步是建立推論所需的元件。

這個過程與使用 Emformer RNN-T 的線上 ASR 相同。

class Pipeline:
    """Build inference pipeline from RNNTBundle.

    Args:
        bundle (torchaudio.pipelines.RNNTBundle): Bundle object
        beam_width (int): Beam size of beam search decoder.
    """

    def __init__(self, bundle: torchaudio.pipelines.RNNTBundle, beam_width: int = 10):
        self.bundle = bundle
        self.feature_extractor = bundle.get_streaming_feature_extractor()
        self.decoder = bundle.get_decoder()
        self.token_processor = bundle.get_token_processor()

        self.beam_width = beam_width

        self.state = None
        self.hypotheses = None

    def infer(self, segment: torch.Tensor) -> str:
        """Perform streaming inference"""
        features, length = self.feature_extractor(segment)
        self.hypotheses, self.state = self.decoder.infer(
            features, length, self.beam_width, state=self.state, hypothesis=self.hypotheses
        )
        transcript = self.token_processor(self.hypotheses[0][0], lstrip=False)
        return transcript

class ContextCacher:
    """Cache the end of input data and prepend the next input data with it.

    Args:
        segment_length (int): The size of main segment.
            If the incoming segment is shorter, then the segment is padded.
        context_length (int): The size of the context, cached and appended.
    """

    def __init__(self, segment_length: int, context_length: int):
        self.segment_length = segment_length
        self.context_length = context_length
        self.context = torch.zeros([context_length])

    def __call__(self, chunk: torch.Tensor):
        if chunk.size(0) < self.segment_length:
            chunk = torch.nn.functional.pad(chunk, (0, self.segment_length - chunk.size(0)))
        chunk_with_context = torch.cat((self.context, chunk))
        self.context = chunk[-self.context_length :]
        return chunk_with_context

5. 主進程¶

主進程的執行流程如下

初始化推論管線。
啟動數據獲取子進程。
執行推論。
清理

注意

由於數據獲取子進程將使用 “spawn” 方法啟動，因此全域範圍內的所有程式碼也將在子進程上執行。

我們只想在主進程中實例化管線，因此我們將它們放入函式中，並在 __name__ == “__main__” 保護範圍內呼叫它。

def main(device, src, bundle):
    print(torch.__version__)
    print(torchaudio.__version__)

    print("Building pipeline...")
    pipeline = Pipeline(bundle)

    sample_rate = bundle.sample_rate
    segment_length = bundle.segment_length * bundle.hop_length
    context_length = bundle.right_context_length * bundle.hop_length

    print(f"Sample rate: {sample_rate}")
    print(f"Main segment: {segment_length} frames ({segment_length / sample_rate} seconds)")
    print(f"Right context: {context_length} frames ({context_length / sample_rate} seconds)")

    cacher = ContextCacher(segment_length, context_length)

    @torch.inference_mode()
    def infer():
        for _ in range(NUM_ITER):
            chunk = q.get()
            segment = cacher(chunk[:, 0])
            transcript = pipeline.infer(segment)
            print(transcript, end="\r", flush=True)

    import torch.multiprocessing as mp

    ctx = mp.get_context("spawn")
    q = ctx.Queue()
    p = ctx.Process(target=stream, args=(q, device, src, segment_length, sample_rate))
    p.start()
    infer()
    p.join()


if __name__ == "__main__":
    main(
        device="avfoundation",
        src=":1",
        bundle=torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH,
    )

Building pipeline...
Sample rate: 16000
Main segment: 2560 frames (0.16 seconds)
Right context: 640 frames (0.04 seconds)
Building StreamReader...
SourceAudioStream(media_type='audio', codec='pcm_f32le', codec_long_name='PCM 32-bit floating point little-endian', format='flt', bit_rate=1536000, sample_rate=48000.0, num_channels=1)
OutputStream(source_index=0, filter_description='aresample=16000,aformat=sample_fmts=fltp')
Streaming...

hello world

標籤：torchaudio.io

腳本的總運行時間： ( 0 分鐘 0.000 秒)

由 Sphinx-Gallery 產生圖庫