參考文獻¶

[Yes]

[AB79]

Jont B Allen and David A Berkley. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America, 65(4):943–950, 1979.

[ABD+20]

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: a massively-multilingual speech corpus. 2020. arXiv:1912.06670.

[BWT+21]

Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, and others. Xls-r: self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296, 2021.

[BZMA20]

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. 2020. arXiv:2006.11477.

[BBL+08]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. Iemocap: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42:335–359, 12 2008. doi:10.1007/s10579-008-9076-6.

[Cap69]

Jack Capon. High-resolution frequency-wavenumber spectrum analysis. Proceedings of the IEEE, 57(8):1408–1418, 1969.

[CDiGangiB+21]

Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Bentivogli, Matteo Negri, and Marco Turchi. Must-c: a multilingual corpus for end-to-end speech translation. Computer Speech & Language, 66:101155, 2021. URL: https://www.sciencedirect.com/science/article/pii/S0885230820300887, doi:https://doi.org/10.1016/j.csl.2020.101155.

[CCW+21]

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, 以及 Zhiyong Yan. Gigaspeech: an evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. Proc. Interspeech 2021. 2021.

[CWC+22]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, 以及其他作者. Wavlm: large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.

[CPS16]

Ronan Collobert, Christian Puhrsch, 以及 Gabriel Synnaeve. Wav2letter: an end-to-end convnet-based speech recognition system. 2016. arXiv:1609.03193.

[CBC+20]

Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, 以及 Michael Auli. Unsupervised cross-lingual representation learning for speech recognition. 2020. arXiv:2006.13979.

[CY21]

Erica Cooper 以及 Junichi Yamagishi. How do voices from past speech synthesis challenges compare today? arXiv preprint arXiv:2105.02373, 2021.

[CPC+20]

Joris Cosentino, Manuel Pariente, Samuele Cornell, Antoine Deleforge, 以及 Emmanuel Vincent. Librimix: an open-source dataset for generalizable speech separation. 2020. arXiv:2005.11262.

[CSB+18]

Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, 以及其他作者. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190, 2018.

[DL82]

DC Dowson 以及 BV666017 Landau. The fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450–455, 1982.

[Defossez21]

Alexandre Défossez. Hybrid spectrogram and waveform source separation. Proceedings of the ISMIR 2021 Workshop on Music Source Separation. 2021.

[FP21]

Marco Forgione 以及 Dario Piga. Dynonet: a neural network architecture for learning dynamical systems. International Journal of Adaptive Control and Signal Processing, 35(4):612–626, 2021.

[GKRR14]

Mark John Francis Gales, Kate Knill, Anton Ragni, 以及 Shakti Prasad Rath. Speech recognition and keyword spotting for low-resource languages: babel project research at cued. SLTU. 2014.

[Gra12]

Alex Graves. Sequence transduction with recurrent neural networks. 2012. arXiv:1211.3711.

[GL83]

D. Griffin 以及 Jae Lim. Signal estimation from modified short-time fourier transform. ICASSP '83. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 8, 804–807. 1983. doi:10.1109/ICASSP.1983.1172092.

[GQC+20]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, 以及 Ruoming Pang. Conformer: convolution-augmented transformer for speech recognition. 2020. arXiv:2005.08100.

[HCC+14]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, 以及 Andrew Y. Ng. Deep speech: scaling up end-to-end speech recognition. 2014. arXiv:1412.5567.

[HCE+17]

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron Weiss, 以及 Kevin Wilson. Cnn architectures for large-scale audio classification. International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017. URL: https://arxiv.org/abs/1609.09430.

[HIA+17]

Takuya Higuchi, Nobutaka Ito, Shoko Araki, Takuya Yoshioka, Marc Delcroix, 以及 Tomohiro Nakatani. Online mvdr beamformer based on complex gaussian mixture model with spatial prior for noise robust asr. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(4):780–793, 2017.

[HIYN16]

Takuya Higuchi, Nobutaka Ito, Takuya Yoshioka, 以及 Tomohiro Nakatani. Robust mvdr beamforming using time-frequency masks for online/offline asr in noise. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5210–5214. IEEE, 2016.

[HBT+21]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, 以及 Abdelrahman Mohamed. Hubert: self-supervised speech representation learning by masked prediction of hidden units. 2021. arXiv:2106.07447.

[IJ17]

Keith Ito 以及 Linda Johnson. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.

[KPL+22]

Jacob Kahn, Vineel Pratap, Tatiana Likhomanenko, Qiantong Xu, Awni Hannun, Jeff Cai, Paden Tomasello, Ann Lee, Edouard Grave, Gilad Avidov, 以及其他作者. Flashlight: enabling innovation in tools for machine learning. arXiv preprint arXiv:2201.12465, 2022.

[KES+18a]

Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, 以及 Koray Kavukcuoglu. Efficient neural audio synthesis. 2018. arXiv:1802.08435.

[KES+18b]

Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, 以及 Koray Kavukcuoglu. Efficient neural audio synthesis. CoRR, 2018. URL: http://arxiv.org/abs/1802.08435, arXiv:1802.08435.

[KPPK15]

Tom Ko, Vijayaditya Peddinti, Daniel Povey, 以及 Sanjeev Khudanpur. Audio augmentation for speech recognition. Proc. Interspeech 2015, 3586–3589. 2015. doi:10.21437/Interspeech.2015-711.

[KBV03]

John Kominek, Alan W Black, 以及 Ver Ver. Cmu arctic databases for speech synthesis. Technical Report, 2003.

[KKB20]

Jungil Kong, Jaehyeon Kim, 以及 Jaekyoung Bae. Hifi-gan: 用於高效且高保真語音合成的生成對抗網路. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, 和 H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, 17022–17033. Curran Associates, Inc., 2020. URL: https://proceedings.neurips.cc/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf.

[KTN+23]

Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, 和 Buye Xu. Torchaudio-squim: torchaudio中無參考的語音品質和清晰度量測. arXiv preprint arXiv:2304.01448, 2023.

[LRI+19]

Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, 和 Yoshua Bengio. 用於端到端口語理解的語音模型預訓練. In Gernot Kubin 和 Zdravko Kacic, editors, Proc. of Interspeech, 814–818. 2019.

[LM19]

Yi Luo 和 Nima Mesgarani. Conv-tasnet: 超越理想的時間-頻率幅度遮罩的語音分離. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, Aug 2019. URL: http://dx.doi.org/10.1109/TASLP.2019.2915167, doi:10.1109/taslp.2019.2915167.

[MK22]

Pranay Manocha 和 Anurag Kumar. 使用非匹配參考的語音品質評估 MOS. arXiv preprint arXiv:2206.12285, 2022.

[MRFB+15]

Xavier Anguera Miro, Luis Javier Rodriguez-Fuentes, Andi Buzo, Florian Metze, Igor Szoke, 和 Mikel Peñagarikano. Quesst2014: 在零資源環境中使用真實查詢評估基於範例的語音搜尋. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5833–5837, 2015.

[MPG29]

RV Mises 和 Hilda Pollaczek-Geiringer. Praktische verfahren der gleichungsauflösung. ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 9(1):58–77, 1929.

[Mys14]

Gautham J Mysore. 我們可以自動將在真實世界環境中由常見消費設備錄製的語音轉換為專業製作品質的語音嗎？—一個資料集、見解和挑戰. IEEE Signal Processing Letters, 22(8):1006–1010, 2014.

[NCZ17]

Arsha Nagrani, Joon Son Chung, 和 Andrew Zisserman. Voxceleb: 一個大型說話者識別資料集. arXiv preprint arXiv:1706.08612, 2017.

[PCPK15]

Vassil Panayotov, Guoguo Chen, Daniel Povey, 和 Sanjeev Khudanpur. Librispeech: 基於公共領域有聲書的 ASR 語料庫. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206–5210. 2015. doi:10.1109/ICASSP.2015.7178964.

[PCZ+19]

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, 和 Quoc V. Le. Specaugment: 一種用於自動語音辨識的簡單資料擴增方法. Interspeech 2019, Sep 2019. URL: http://dx.doi.org/10.21437/Interspeech.2019-2680, doi:10.21437/interspeech.2019-2680.

[PBS13]

Nathanaël Perraudin, Peter Balazs, 和 Peter L. Søndergaard. 一種快速的 Griffin-Lim 演算法. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, volume, 1–4. 2013. doi:10.1109/WASPAA.2013.6701851.

[PTS+23]

Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, 和 Michael Auli. 將語音技術擴展到 1,000 多種語言. 2023. arXiv:2305.13516.

[PXS+20]

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, 和 Ronan Collobert. Mls: 用於語音研究的大型多語言資料集. Interspeech 2020, Oct 2020. URL: http://dx.doi.org/10.21437/Interspeech.2020-2826, doi:10.21437/interspeech.2020-2826.

[RLStoter+19]

Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, 和 Rachel Bittner. MUSDB18-HQ - musdb18 的未壓縮版本. December 2019. URL: https://doi.org/10.5281/zenodo.3338373, doi:10.5281/zenodo.3338373.

[RGC+20]

Chandan KA Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, 以及 others. Interspeech 2020 深度雜訊抑制挑戰賽：資料集、主觀測試框架和挑戰賽結果. arXiv preprint arXiv:2005.13981, 2020.

[RDelegliseEsteve12]

Anthony Rousseau, Paul Deléglise, 和 Yannick Estève. Ted-lium: 一個專用的自動語音辨識語料庫. In Conference on Language Resources and Evaluation (LREC), 125–129. 2012.

[SY18]

Seyyed Saeed Sarfjoo 和 Junichi Yamagishi. 設備錄製的 VCTK（小型子集版本）. 2018.

[SBDokmanic18]

Robin Scheibler, Eric Bezzam, 和 Ivan Dokmanić. Pyroomacoustics：用於音訊房間模擬和陣列處理演算法的 Python 套件. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 351–355. IEEE, 2018.

[SPW+18]

Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, 以及 others. 通過將 WaveNet 建立在 Mel 頻譜預測上實現自然 TTS 合成. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–4783. IEEE, 2018.

[SWW+21]

Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, 和 Mike Seltzer. Emformer：基於高效記憶轉換器的低延遲串流語音辨識聲學模型. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6783–6787. 2021.

[SWW+22]

Yangyang Shi, Chunyang Wu, Dilin Wang, Alex Xiao, Jay Mahadeokar, Xiaohui Zhang, Chunxi Liu, Ke Li, Yuan Shangguan, Varun Nagaraja, Ozlem Kalinli, 和 Mike Seltzer. 使用非因果迴旋積的基於串流轉換器傳感器的語音辨識. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 8277–8281. 2022. doi:10.1109/ICASSP43922.2022.9747706.

[Smi20]

Julius O. Smith. Digital audio resampling home page "theory of ideal bandlimited interpolation" section. 2020年9月。網址：https://ccrma.stanford.edu/~jos/resample/Theory_Ideal_Bandlimited_Interpolation.html。

[SCP15]

David Snyder, Guoguo Chen, 和 Daniel Povey. MUSAN: A Music, Speech, and Noise Corpus. 2015. arXiv:1510.08484v1. arXiv:1510.08484。

[SBA09]

Mehrez Souden, Jacob Benesty, 和 Sofiene Affes. On optimal frequency-domain multichannel linear filtering for noise reduction. 於 IEEE Transactions on audio, speech, and language processing, 卷 18, 260–276. IEEE, 2009。

[SWT+22]

Sangeeta Srivastava, Yun Wang, Andros Tjandra, Anurag Kumar, Chunxi Liu, Kritika Singh, 和 Yatharth Saraf. Conformer-based self-supervised learning for non-speech audio tasks. 於 ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 卷, 8862–8866. 2022. doi:10.1109/ICASSP43922.2022.9746490。

[TEC01]

George Tzanetakis, Georg Essl, 和 Perry Cook. Automatic musical genre classification of audio signals. 2001. 網址：http://ismir2001.ismir.net/pdf/tzanetakis.pdf。

[VAlumae21]

Jörgen Valk 和 Tanel Alumäe. Voxlingua107: a dataset for spoken language recognition. 於 2021 IEEE Spoken Language Technology Workshop (SLT), 652–658. IEEE, 2021。

[WRiviereL+21]

Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, 和 Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, 2021. 網址：https://arxiv.org/abs/2101.00390, arXiv:2101.00390。

[Wei98]

R.L. Weide. The carnegie mellon pronuncing dictionary. 1998. 網址：http://www.speech.cs.cmu.edu/cgi-bin/cmudict。

[YVM19]

Junichi Yamagishi, Christophe Veaux, 和 Kirsten MacDonald. CSTR VCTK Corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). 2019. doi:10.7488/ds/2645。

[YF23]

Chin-Yun Yu 和 György Fazekas. Singing voice synthesis using differentiable LPC and glottal-flow-inspired wavetables. 於 Augusto Sarti, Fabio Antonacci, Mark Sandler, Paolo Bestagini, Simon Dixon, Beici Liang, Gaël Richard, 和 Johan Pauwels, 編輯者, Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023, Milan, Italy, November 5-9, 2023, 667–675. 2023. 網址：https://doi.org/10.5281/zenodo.10265377, doi:10.5281/ZENODO.10265377。

[ZDC+19]

Heiga Zen, Viet-Trung Dang, Robert A. J. Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Z. Chen, 和 Yonghui Wu. Libritts: a corpus derived from librispeech for text-to-speech. ArXiv, 2019。

[ZSN21]

Albert Zeyer, Ralf Schlüter, 和 Hermann Ney. Why does ctc result in peaky behavior? 2021. arXiv:2105.14849。

[BrianMcFeeColinRaffelDawenLiang+15]

Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric Battenberg, 和 Oriol Nieto. Librosa: Audio and Music Signal Analysis in Python. 於 Kathryn Huff 和 James Bergstra, 編輯者, Proceedings of the 14th Python in Science Conference, 18 – 24. 2015. doi:10.25080/Majora-7b98e3ed-003。

[KahnRiviereZheng+20]

J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, 和 E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. 於 ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669–7673. 2020. https://github.com/facebookresearch/libri-light。

[Warden18]

P. Warden. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. ArXiv e-prints, 2018年4月。網址：https://arxiv.org/abs/1804.03209, arXiv:1804.03209。

[Wikipediacontributors]

Wikipedia contributors. Absorption (acoustics) — Wikipedia, the free encyclopedia. [線上]. 網址：https://en.wikipedia.org/wiki/Absorption_(acoustics)。

參考文獻¶

文件

教學

資源