注意

點擊這裡下載完整的範例程式碼

StreamWriter 基本用法¶

作者: Moto Hira

本教學展示如何使用 torchaudio.io.StreamWriter 將音訊/視訊資料編碼並儲存為各種格式/目的地。

注意

本教學需要 FFmpeg 函式庫。詳細資訊請參考FFmpeg 相依性。

警告

TorchAudio 會動態載入系統上安裝的相容 FFmpeg 函式庫。支援的格式類型（媒體格式、編碼器、編碼器選項等）取決於這些函式庫。

若要檢查可用的多工器和編碼器，您可以使用以下指令

ffmpeg -muxers
ffmpeg -encoders

準備¶

import torch
import torchaudio

print(torch.__version__)
print(torchaudio.__version__)

from torchaudio.io import StreamWriter

print("FFmpeg library versions")
for k, v in torchaudio.utils.ffmpeg_utils.get_versions().items():
    print(f"  {k}: {v}")

2.6.0
2.6.0
FFmpeg library versions
  libavcodec: (60, 3, 100)
  libavdevice: (60, 1, 100)
  libavfilter: (9, 3, 100)
  libavformat: (60, 3, 100)
  libavutil: (58, 2, 100)

import io
import os
import tempfile

from IPython.display import Audio, Video

from torchaudio.utils import download_asset

SAMPLE_PATH = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")
WAVEFORM, SAMPLE_RATE = torchaudio.load(SAMPLE_PATH, channels_first=False)
NUM_FRAMES, NUM_CHANNELS = WAVEFORM.shape

_BASE_DIR = tempfile.TemporaryDirectory()


def get_path(filename):
    return os.path.join(_BASE_DIR.name, filename)

基本用法¶

要使用 StreamWriter 將 Tensor 資料儲存為媒體格式，需要三個步驟

指定輸出
設定串流
寫入資料

以下程式碼說明如何將音訊資料儲存為 WAV 檔案。

# 1. Define the destination. (local file in this case)
path = get_path("test.wav")
s = StreamWriter(path)

# 2. Configure the stream. (8kHz, Stereo WAV)
s.add_audio_stream(
    sample_rate=SAMPLE_RATE,
    num_channels=NUM_CHANNELS,
)

# 3. Write the data
with s.open():
    s.write_audio_chunk(0, WAVEFORM)

Audio(path)

現在我們更詳細地研究每個步驟。

寫入目的地¶

StreamWriter 支援不同類型的寫入目的地

本機檔案
類檔案物件
串流協定 (例如 RTMP 和 UDP)
媒體裝置 (喇叭和視訊播放器) †

† 關於媒體裝置，請參考 StreamWriter 進階用法。

本機檔案¶

StreamWriter 支援將媒體儲存到本機檔案。

StreamWriter(dst="audio.wav")

StreamWriter(dst="audio.mp3")

這也適用於靜態影像和視訊。

StreamWriter(dst="image.jpeg")

StreamWriter(dst="video.mpeg")

類檔案物件¶

您也可以傳遞類檔案物件。類檔案物件必須實作符合 io.RawIOBase.write 的 write 方法。

# Open the local file as fileobj
with open("audio.wav", "wb") as dst:
    StreamWriter(dst=dst)

# In-memory encoding
buffer = io.BytesIO()
StreamWriter(dst=buffer)

串流協定¶

您可以使用串流協定串流媒體

# Real-Time Messaging Protocol
StreamWriter(dst="rtmp://127.0.0.1:1234/live/app", format="flv")

# UDP
StreamWriter(dst="udp://127.0.0.1:48550", format="mpegts")

設定輸出串流¶

指定目的地後，下一步是設定串流。對於典型的音訊和靜態影像情況，只需要一個串流，但對於帶有音訊的視訊，至少需要設定兩個串流（一個用於音訊，另一個用於視訊）。

音訊串流¶

可以使用 add_audio_stream() 方法新增音訊串流。

對於寫入常規音訊檔案，至少需要 sample_rate 和 num_channels。

s = StreamWriter("audio.wav")
s.add_audio_stream(sample_rate=8000, num_channels=2)

預設情況下，音訊串流期望輸入的波形張量為 torch.float32 類型。如果是這種情況，資料將被編碼為 WAV 格式的預設編碼格式，即 16 位有符號整數線性 PCM。 StreamWriter 會在內部轉換樣本格式。

如果編碼器支援多種樣本格式，並且您想要更改編碼器樣本格式，則可以使用 encoder_format 選項。

在以下範例中，StreamWriter 期望輸入波形張量的資料類型為 torch.float32，但在編碼時會將樣本轉換為 16 位有符號整數。

s = StreamWriter("audio.mp3")
s.add_audio_stream(
    ...,
    encoder="libmp3lame",   # "libmp3lame" is often the default encoder for mp3,
                            # but specifying it manually, for the sake of illustration.

    encoder_format="s16p",  # "libmp3lame" encoder supports the following sample format.
                            #  - "s16p" (16-bit signed integer)
                            #  - "s32p" (32-bit signed integer)
                            #  - "fltp" (32-bit floating point)
)

如果您的波形張量的資料類型不是 torch.float32，您可以提供 format 選項來更改預期的資料類型。

以下範例將 StreamWriter 配置為期望 torch.int16 類型的張量。

# Audio data passed to StreamWriter must be torch.int16
s.add_audio_stream(..., format="s16")

下圖說明了 format 和 encoder_format 選項如何用於音訊串流。

https://download.pytorch.org/torchaudio/tutorial-assets/streamwriter-format-audio.png

視訊串流¶

要新增靜態影像或視訊串流，您可以使用 add_video_stream() 方法。

至少需要 frame_rate、height 和 width。

s = StreamWriter("video.mp4")
s.add_video_stream(frame_rate=10, height=96, width=128)

對於靜態影像，請使用 frame_rate=1。

s = StreamWriter("image.png")
s.add_video_stream(frame_rate=1, ...)

與音訊串流類似，您可以提供 format 和 encoder_format 選項來控制輸入資料和編碼的格式。

以下範例以 YUV422 格式編碼視訊資料。

s = StreamWriter("video.mov")
s.add_video_stream(
    ...,
    encoder="libx264",  # libx264 supports different YUV formats, such as
                        # yuv420p yuvj420p yuv422p yuvj422p yuv444p yuvj444p nv12 nv16 nv21

    encoder_format="yuv422p",  # StreamWriter will convert the input data to YUV422 internally
)

YUV 格式通常用於視訊編碼。許多 YUV 格式由色度通道組成，其平面大小與亮度通道不同。這使得難以直接將其表示為 torch.Tensor 類型。因此，StreamWriter 將自動將輸入視訊張量轉換為目標格式。

StreamWriter 期望輸入影像張量為 4-D（time、channel、height、width）和 torch.uint8 類型。

預設顏色通道為 RGB。也就是說，三個顏色通道分別對應於紅色、綠色和藍色。如果您的輸入具有不同的顏色通道，例如 BGR 和 YUV，您可以使用 format 選項指定它。

以下範例指定 BGR 格式。

s.add_video_stream(..., format="bgr24")
                   # Image data passed to StreamWriter must have
                   # three color channels representing Blue Green Red.
                   #
                   # The shape of the input tensor has to be
                   # (time, channel==3, height, width)

下圖說明了 format 和 encoder_format 選項如何用於視訊串流。

https://download.pytorch.org/torchaudio/tutorial-assets/streamwriter-format-video.png

寫入資料¶

配置串流後，下一步是開啟輸出位置並開始寫入資料。

使用 open() 方法開啟目的地，然後使用 write_audio_chunk() 和/或 write_video_chunk() 寫入資料。

音訊張量應具有 (time, channels) 的形狀，而視訊/影像張量應具有 (time, channels, height, width) 的形狀。

通道、高度和寬度必須與相應串流的配置相符，並使用 "format" 選項指定。

代表靜態影像的張量在時間維度中必須只有一個影格，但音訊和視訊張量在時間維度中可以具有任意數量的影格。

以下程式碼片段說明了這一點；

範例) 音訊¶

# Configure stream
s = StreamWriter(dst=get_path("audio.wav"))
s.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)

# Write data
with s.open():
    s.write_audio_chunk(0, WAVEFORM)

範例) 影像¶

# Image config
height = 96
width = 128

# Configure stream
s = StreamWriter(dst=get_path("image.png"))
s.add_video_stream(frame_rate=1, height=height, width=width, format="rgb24")

# Generate image
chunk = torch.randint(256, (1, 3, height, width), dtype=torch.uint8)

# Write data
with s.open():
    s.write_video_chunk(0, chunk)

範例) 無音訊的視訊¶

# Video config
frame_rate = 30
height = 96
width = 128

# Configure stream
s = StreamWriter(dst=get_path("video.mp4"))
s.add_video_stream(frame_rate=frame_rate, height=height, width=width, format="rgb24")

# Generate video chunk (3 seconds)
time = int(frame_rate * 3)
chunk = torch.randint(256, (time, 3, height, width), dtype=torch.uint8)

# Write data
with s.open():
    s.write_video_chunk(0, chunk)

範例) 具有音訊的視訊¶

要寫入具有音訊的視訊，必須配置單獨的串流。

# Configure stream
s = StreamWriter(dst=get_path("video.mp4"))
s.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)
s.add_video_stream(frame_rate=frame_rate, height=height, width=width, format="rgb24")

# Generate audio/video chunk (3 seconds)
time = int(SAMPLE_RATE * 3)
audio_chunk = torch.randn((time, NUM_CHANNELS))
time = int(frame_rate * 3)
video_chunk = torch.randint(256, (time, 3, height, width), dtype=torch.uint8)

# Write data
with s.open():
    s.write_audio_chunk(0, audio_chunk)
    s.write_video_chunk(1, video_chunk)

逐個區塊地寫入資料¶

寫入資料時，可以沿時間維度分割資料，並將其以較小的區塊寫入。

# Write data in one-go
dst1 = io.BytesIO()
s = StreamWriter(dst=dst1, format="mp3")
s.add_audio_stream(SAMPLE_RATE, NUM_CHANNELS)
with s.open():
    s.write_audio_chunk(0, WAVEFORM)

# Write data in smaller chunks
dst2 = io.BytesIO()
s = StreamWriter(dst=dst2, format="mp3")
s.add_audio_stream(SAMPLE_RATE, NUM_CHANNELS)
with s.open():
    for start in range(0, NUM_FRAMES, SAMPLE_RATE):
        end = start + SAMPLE_RATE
        s.write_audio_chunk(0, WAVEFORM[start:end, ...])

# Check that the contents are same
dst1.seek(0)
bytes1 = dst1.read()

print(f"bytes1: {len(bytes1)}")
print(f"{bytes1[:10]}...{bytes1[-10:]}\n")

dst2.seek(0)
bytes2 = dst2.read()

print(f"bytes2: {len(bytes2)}")
print(f"{bytes2[:10]}...{bytes2[-10:]}\n")

assert bytes1 == bytes2

import matplotlib.pyplot as plt

bytes1: 10700
b'ID3\x04\x00\x00\x00\x00\x00"'...b'\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'

bytes2: 10700
b'ID3\x04\x00\x00\x00\x00\x00"'...b'\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'

範例 - 頻譜視覺化工具¶

在本節中，我們使用 StreamWriter 建立音訊的頻譜視覺化效果，並將其儲存為視訊檔案。

為了建立頻譜視覺化效果，我們使用 torchaudio.transforms.Spectrogram 取得音訊的頻譜表示，使用 matplotplib 產生其視覺化的點陣影像，然後使用 StreamWriter 將其轉換為具有原始音訊的視訊。

import torchaudio.transforms as T

準備資料¶

首先，我們準備頻譜圖資料。我們使用 Spectrogram。

我們調整 hop_length，以便頻譜圖的一個影格對應於一個視訊影格。

frame_rate = 20
n_fft = 4000

trans = T.Spectrogram(
    n_fft=n_fft,
    hop_length=SAMPLE_RATE // frame_rate,  # One FFT per one video frame
    normalized=True,
    power=1,
)
specs = trans(WAVEFORM.T)[0].T

產生的頻譜圖如下所示。

spec_db = T.AmplitudeToDB(stype="magnitude", top_db=80)(specs.T)
_ = plt.imshow(spec_db, aspect="auto", origin="lower")

準備畫布¶

我們使用 matplotlib 逐影格視覺化頻譜圖。我們建立一個輔助函式，該函式繪製頻譜圖資料並產生該圖形的點陣影像。

fig, ax = plt.subplots(figsize=[3.2, 2.4])
ax.set_position([0, 0, 1, 1])
ax.set_facecolor("black")
ncols, nrows = fig.canvas.get_width_height()


def _plot(data):
    ax.clear()
    x = list(range(len(data)))
    R, G, B = 238 / 255, 76 / 255, 44 / 255
    for coeff, alpha in [(0.8, 0.7), (1, 1)]:
        d = data**coeff
        ax.fill_between(x, d, -d, color=[R, G, B, alpha])
    xlim = n_fft // 2 + 1
    ax.set_xlim([-1, n_fft // 2 + 1])
    ax.set_ylim([-1, 1])
    ax.text(
        xlim,
        0.95,
        f"Created with TorchAudio\n{torchaudio.__version__}",
        color="white",
        ha="right",
        va="top",
        backgroundcolor="black",
    )
    fig.canvas.draw()
    frame = torch.frombuffer(fig.canvas.tostring_rgb(), dtype=torch.uint8)
    return frame.reshape(nrows, ncols, 3).permute(2, 0, 1)


# sphinx_gallery_defer_figures

寫入視訊¶

最後，我們使用 StreamWriter 並寫入視訊。我們一次處理一秒鐘的音訊和視訊影格。

s = StreamWriter(get_path("example.mp4"))
s.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)
s.add_video_stream(frame_rate=frame_rate, height=nrows, width=ncols)

with s.open():
    i = 0
    # Process by second
    for t in range(0, NUM_FRAMES, SAMPLE_RATE):
        # Write audio chunk
        s.write_audio_chunk(0, WAVEFORM[t : t + SAMPLE_RATE, :])

        # write 1 second of video chunk
        frames = [_plot(spec) for spec in specs[i : i + frame_rate]]
        if frames:
            s.write_video_chunk(1, torch.stack(frames))
        i += frame_rate

plt.close(fig)

/pytorch/audio/examples/tutorials/streamwriter_basic_tutorial.py:566: MatplotlibDeprecationWarning: The tostring_rgb function was deprecated in Matplotlib 3.8 and will be removed two minor releases later. Use buffer_rgba instead.
  frame = torch.frombuffer(fig.canvas.tostring_rgb(), dtype=torch.uint8)
/pytorch/audio/examples/tutorials/streamwriter_basic_tutorial.py:566: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1727971112454/work/torch/csrc/utils/tensor_new.cpp:1560.)
  frame = torch.frombuffer(fig.canvas.tostring_rgb(), dtype=torch.uint8)

結果¶

結果如下所示。

Video(get_path("example.mp4"), embed=True)

仔細觀看視訊，可以觀察到 “s” 的聲音（curiosity, besides, this）在高頻側（視訊的右側）分配了更多的能量。

標籤：torchaudio.io

腳本的總執行時間： ( 0 分鐘 7.371 秒)

由 Sphinx-Gallery 產生的圖庫