torchtext.datasets¶

警告

torchtext 支援的資料集是來自 torchdata 專案的資料管道，該專案仍處於 Beta 狀態。這表示 API 可以在沒有棄用週期的前提下變更。特別是，我們預計隨著 torchdata 最終發布 DataLoaderV2，許多目前的慣用語將會變更。

以下是有關使用資料管道的一些建議

若要對資料管道進行洗牌，請在 DataLoader 中執行：DataLoader(dp, shuffle=True)。您不需要呼叫 dp.shuffle()，因為 torchtext 已經為您完成了這項操作。但請注意，除非您明確地將 shuffle=True 傳遞給 DataLoader，否則不會對資料管道進行洗牌。

使用多重處理（num_workers=N）時，請使用內建的 worker_init_fn

from torch.utils.data.backward_compatibility import worker_init_fn
DataLoader(dp, num_workers=4, worker_init_fn=worker_init_fn, drop_last=True)

這將確保資料不會在工作進程之間重複。

我們也建議使用 drop_last=True。如果沒有這個，在某些情況下，時期結束時的批次大小可能會非常小（比其他地圖樣式資料集小）。這可能會極大地影響準確性，尤其是在使用批次正規化時。 drop_last=True 可確保所有批次大小都相等。
使用 DistributedDataParallel 進行分散式訓練還不完全穩定/支援，我們目前不建議使用。它將在 DataLoaderV2 中得到更好的支援。如果您仍然希望使用 DDP，請確保
- 所有工作進程（DDP 工作進程*和* DataLoader 工作進程）都能看到資料的不同部分。資料集已經包裝在 ShardingFilter 中，您可能需要呼叫 dp.apply_sharding(num_shards, shard_id) 才能在等級（DDP 工作進程）和 DataLoader 工作進程之間對資料進行分片。一種方法是建立呼叫 apply_sharding 並具有適當分片數（DDP 工作進程 * DataLoader 工作進程）和分片 ID（通過等級和等級內對應 DataLoader 的工作進程 ID 推斷）的 worker_init_fn。但請注意，這假設所有等級的 DataLoader 工作進程數相等。
- 所有 DDP 工作進程都在相同數量的批次上工作。一種方法是將每個工作進程內的資料管道大小限制為 len(datapipe) // num_ddp_workers，但这可能不適合所有用例。
- 所有工作進程的洗牌種子都相同。您可能需要呼叫 torch.utils.data.graph_settings.apply_shuffle_seed(dp, rng)
- 每個時期的洗牌種子都不同。
- 其餘的 RNG（通常用於轉換）在工作進程之間*不同*，以獲得最大熵和最佳準確性。

一般用例如下

# import datasets
from torchtext.datasets import IMDB

train_iter = IMDB(split='train')

def tokenize(label, line):
    return line.split()

tokens = []
for label, line in train_iter:
    tokens += tokenize(label, line)

目前提供以下資料集。如果您想貢獻新的資料集到倉庫或使用您自己的自訂資料集，請參閱 CONTRIBUTING_DATASETS.md 指南。

文字分類 ¶

AG_NEWS ¶

torchtext.datasets.AG_NEWS(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶

AG_NEWS 資料集

警告

使用資料管道目前仍有一些注意事項。如果您想使用此資料集進行洗牌、多重處理或分散式學習，請參閱此說明以取得進一步指示。

如需更多詳細資訊，請參閱 https://paperswithcode.com/dataset/ag-news

每個分割區的行數

訓練：120000
測試：7600

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割區或分割區。可以是字串或字串元組。預設值：(train, test)

傳回值：:

產生標籤元組（1 到 4）和文字的 DataPipe

返回類型：:

(int, str)

AmazonReviewFull ¶

torchtext.datasets.AmazonReviewFull(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶

AmazonReviewFull 資料集

警告

使用資料管道目前仍有一些注意事項。如果您想使用此資料集進行洗牌、多重處理或分散式學習，請參閱此說明以取得進一步指示。

如需更多詳細資訊，請參閱 https://arxiv.org/abs/1509.01626

每個分割區的行數

訓練：3000000
測試：650000

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割區或分割區。可以是字串或字串元組。預設值：(train, test)

傳回值：:

產生標籤 (1 到 5) 和包含評論標題和文字的文字元組的 DataPipe。

返回類型：:

(int, str)

AmazonReviewPolarity ¶

torchtext.datasets.AmazonReviewPolarity(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶

AmazonReviewPolarity 資料集

警告

使用資料管道目前仍有一些注意事項。如果您想使用此資料集進行洗牌、多重處理或分散式學習，請參閱此說明以取得進一步指示。

如需更多詳細資訊，請參閱 https://arxiv.org/abs/1509.01626

每個分割區的行數

訓練：3600000
測試：400000

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割區或分割區。可以是字串或字串元組。預設值：(train, test)

傳回值：:

產生標籤 (1 到 2) 和包含評論標題和文字的文字元組的 DataPipe。

返回類型：:

(int, str)

CoLA ¶

torchtext.datasets.CoLA(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev', 'test'))[source]¶

CoLA 資料集

警告

使用資料管道目前仍有一些注意事項。如果您想使用此資料集進行洗牌、多重處理或分散式學習，請參閱此說明以取得進一步指示。

有關更多詳細資訊，請參閱 https://nyu-mll.github.io/CoLA/

每個分割區的行數

訓練：8551
開發：527
測試：516

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(train, dev, test)

傳回值：:

產生 CoLA 資料集中各列的 DataPipe（來源 (str)、標籤 (int)、句子 (str)）。

返回類型：:

(str, int, str)

DBpedia ¶

torchtext.datasets.DBpedia(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶

DBpedia 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

有關更多詳細資訊，請參閱 https://www.dbpedia.org/resources/latest-core/

每個分割區的行數

訓練：560000
測試：70000

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割區或分割區。可以是字串或字串元組。預設值：(train, test)

傳回值：:

產生標籤 (1 到 14) 和包含新聞標題和內容的文字元組的 DataPipe。

返回類型：:

(int, str)

IMDb ¶

torchtext.datasets.IMDb(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶

IMDb 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

有關更多詳細資訊，請參閱 http://ai.stanford.edu/~amaas/data/sentiment/

每個分割區的行數

訓練：25000
測試：25000

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割區或分割區。可以是字串或字串元組。預設值：(train, test)

傳回值：:

產生標籤 (1 到 2) 和包含電影評論的文字元組的 DataPipe。

返回類型：:

(int, str)

使用 IMDB 的教學課程: 用於摘要、情感分類和翻譯的 T5-Base 模型

用於摘要、情感分類和翻譯的 T5-Base 模型

MNLI ¶

torchtext.datasets.MNLI(root='.data', split=('train', 'dev_matched', 'dev_mismatched'))[source]¶

MNLI 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

有關更多詳細資訊，請參閱 https://cims.nyu.edu/~sbowman/multinli/

每個分割區的行數

訓練：392702
開發（匹配）：9815
開發（不匹配）：9832

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(train, dev_matched, dev_mismatched)

傳回值：:

產生文字和標籤 (0 到 2) 元組的 DataPipe。

返回類型：:

Tuple[int, str, str]

MRPC ¶

torchtext.datasets.MRPC(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶

MRPC 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

有關更多詳細資訊，請參閱 https://www.microsoft.com/en-us/download/details.aspx?id=52398

每個分割區的行數

訓練：4076
測試：1725

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割區或分割區。可以是字串或字串元組。預設值：(train, test)

傳回值：:

產生 MRPC 資料集中資料點的 DataPipe，這些資料點由標籤、句子 1 和句子 2 組成。

返回類型：:

(int, str, str)

QNLI ¶

torchtext.datasets.QNLI(root='.data', split=('train', 'dev', 'test'))[source]¶

QNLI 資料集

如需更多詳細信息，請參閱 https://arxiv.org/pdf/1804.07461.pdf（來自 GLUE 論文）

每個分割區的行數

訓練集：104743
驗證集：5463
測試集：5463

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(train, dev, test)

傳回值：:

產生文字和標籤元組（0 和 1）的 DataPipe。

返回類型：:

(int, str, str)

QQP ¶

torchtext.datasets.QQP(root: str)[source]¶

QQP 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細信息，請參閱 https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

參數：:: root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
傳回值：:: 產生 QQP 資料集行的 DataPipe（標籤 (int)、問題 1 (str)、問題 2 (str)）
返回類型：:: (int, str, str)

RTE ¶

torchtext.datasets.RTE(root='.data', split=('train', 'dev', 'test'))[source]¶

RTE 資料集

如需更多詳細信息，請參閱 https://aclweb.org/aclwiki/Recognizing_Textual_Entailment

每個分割區的行數

訓練集：2490
驗證集：277
測試集：3000

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(train, dev, test)

傳回值：:

產生文字和/或標籤元組（0 和 1）的 DataPipe。測試集僅返回文字。

返回類型：:

Union[(int, str, str), (str, str)]

SogouNews ¶

torchtext.datasets.SogouNews(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶

搜狗新聞資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://arxiv.org/abs/1509.01626

每個分割區的行數

訓練集：450000

測試集：60000

參數
root：儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split：要返回的分割或多個分割。可以是字串或字串元組。預設值：(訓練集, 測試集)

返回：

產生標籤元組（1 到 5）和包含新聞標題和內容的文字的 DataPipe

返回類型：

（int，str）

SST2 ¶

torchtext.datasets.SST2(root='.data', split=('train', 'dev', 'test'))[source]¶

SST2 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細信息，請參閱 https://nlp.stanford.edu/sentiment/

每個分割區的行數

訓練集：67349
驗證集：872
測試集：1821

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(train, dev, test)

傳回值：:

產生文字和/或標籤元組（1 到 4）的 DataPipe。測試集僅返回文字。

返回類型：:

Union[(int, str), (str,)]

使用 SST2 的教學: 使用 XLM-RoBERTa 模型進行 SST-2 二元文字分類

使用 XLM-RoBERTa 模型進行 SST-2 二元文字分類

STSB ¶

torchtext.datasets.STSB(root='.data', split=('train', 'dev', 'test'))[source]¶

STSB 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細信息，請參閱 https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark

每個分割區的行數

訓練集：5749
驗證集：1500
測試集：1379

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(train, dev, test)

傳回值：:

產生元組（索引 (int)、標籤 (float)、句子 1 (str)、句子 2 (str)）的 DataPipe

返回類型：:

(int, float, str, str)

WNLI ¶

torchtext.datasets.WNLI(root='.data', split=('train', 'dev', 'test'))[source]¶

WNLI 資料集

如需更多詳細信息，請參閱 https://arxiv.org/pdf/1804.07461v3.pdf

每個分割區的行數

訓練集：635
驗證集：71
測試集：146

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(train, dev, test)

傳回值：:

產生文字和/或標籤元組（0 到 1）的 DataPipe。測試集僅返回文字。

返回類型：:

Union[(int, str, str), (str, str)]

YahooAnswers ¶

torchtext.datasets.YahooAnswers(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶

Yahoo Answers 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://arxiv.org/abs/1509.01626

每個分割區的行數

訓練集：1400000
測試集：60000

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割區或分割區。可以是字串或字串元組。預設值：(train, test)

傳回值：:

產生標籤元組（1 到 10）和包含問題標題、問題內容和最佳答案的文字的 DataPipe

返回類型：:

(int, str)

YelpReviewFull ¶

torchtext.datasets.YelpReviewFull(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶

Yelp 完整評論資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://arxiv.org/abs/1509.01626

每個分割區的行數

訓練集：650000
測試集：50000

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割區或分割區。可以是字串或字串元組。預設值：(train, test)

傳回值：:

產生標籤元組（1 到 5）和包含評論文字的 DataPipe

返回類型：:

(int, str)

Yelp 評論極性 ¶

torchtext.datasets.YelpReviewPolarity(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[原始碼]¶

Yelp 評論極性資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://arxiv.org/abs/1509.01626

每個分割區的行數

訓練：560000
測試：38000

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割區或分割區。可以是字串或字串元組。預設值：(train, test)

傳回值：:

產生標籤 (1 到 2) 和包含評論的文字元組的資料管道

返回類型：:

(int, str)

語言建模 ¶

Penn Treebank ¶

torchtext.datasets.PennTreebank(root='.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[原始碼]¶

Penn Treebank 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html

每個分割區的行數

訓練：42068
驗證：3370
測試：3761

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(‘train’, ‘valid’, ‘test’)

傳回值：:

產生來自 Treebank 語料庫文字的資料管道

返回類型：:

字串

WikiText-2 ¶

torchtext.datasets.WikiText2(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[原始碼]¶

WikiText2 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

每個分割區的行數

訓練：36718
驗證：3760
測試：4358

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(‘train’, ‘valid’, ‘test’)

傳回值：:

產生來自維基百科文章文字的資料管道

返回類型：:

字串

WikiText103 ¶

torchtext.datasets.WikiText103(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[原始碼]¶

WikiText103 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

每個分割區的行數

訓練：1801350
驗證：3760
測試：4358

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(‘train’, ‘valid’, ‘test’)

傳回值：:

產生來自維基百科文章文字的資料管道

返回類型：:

字串

機器翻譯 ¶

IWSLT2016 ¶

torchtext.datasets.IWSLT2016(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')[原始碼]¶

IWSLT2016 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://wit3.fbk.eu/2016-01

可用的資料集包括以下內容

語言對:

	“en”	“fr”	“de”	“cs”	“ar”
“en”		x	x	x	x
“fr”	x
“de”	x
“cs”	x
“ar”	x

驗證/測試集：[“dev2010”, “tst2010”, “tst2011”, “tst2012”, “tst2013”, “tst2014”]

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(‘train’, ‘valid’, ‘test’)
language_pair – 包含來源和目標語言的元組或清單
valid_set – 用於識別驗證集的字串。
test_set – 用於識別測試集的字串。

傳回值：:

產生來源和目標句子元組的資料管道

返回類型：:

(字串, 字串)

範例

>>> from torchtext.datasets import IWSLT2016
>>> train_iter, valid_iter, test_iter = IWSLT2016()
>>> src_sentence, tgt_sentence = next(iter(train_iter))

IWSLT2017 ¶

torchtext.datasets.IWSLT2017(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[原始碼]¶

IWSLT2017 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://wit3.fbk.eu/2017-01

可用的資料集包括以下內容

語言對:

	“en”	“nl”	“de”	“it”	“ro”
“en”		x	x	x	x
“nl”	x		x	x	x
“de”	x	x		x	x
“it”	x	x	x		x
“ro”	x	x	x	x

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(‘train’, ‘valid’, ‘test’)
language_pair – 包含來源和目標語言的元組或清單

傳回值：:

產生來源和目標句子元組的資料管道

返回類型：:

(字串, 字串)

範例

>>> from torchtext.datasets import IWSLT2017
>>> train_iter, valid_iter, test_iter = IWSLT2017()
>>> src_sentence, tgt_sentence = next(iter(train_iter))

Multi30k ¶

torchtext.datasets.Multi30k(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair: Tuple[str] = ('de', 'en'))[原始碼]¶

Multi30k 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://www.statmt.org/wmt16/multimodal-task.html#task1

每個分割區的行數

訓練集：29000
驗證集：1014
測試集：1000

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(‘train’, ‘valid’, ‘test’)
language_pair - 包含來源和目標語言的元組或清單。可選項為 ('de', 'en') 和 ('en', 'de')

傳回值：:

產生來源和目標句子元組的資料管道

返回類型：:

(字串, 字串)

使用 Multi30k 的教學: 用於摘要、情感分類和翻譯的 T5-Base 模型

用於摘要、情感分類和翻譯的 T5-Base 模型

序列標註 ¶

CoNLL2000Chunking ¶

torchtext.datasets.CoNLL2000Chunking(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[原始碼]¶

CoNLL2000Chunking 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://www.clips.uantwerpen.be/conll2000/chunking/

每個分割區的行數

訓練集：8936
測試集：2012

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割區或分割區。可以是字串或字串元組。預設值：(train, test)

傳回值：:

產生單詞清單以及對應詞性標籤和組塊標籤的 DataPipe

返回類型：:

[list(str), list(str), list(str)]

UDPOS ¶

torchtext.datasets.UDPOS(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[原始碼]¶

UDPOS 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

每個分割區的行數

訓練集：12543
驗證集：2002
測試集：2077

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split – 要返回的分割或多個分割。可以是字串或字串元組。預設值：(‘train’, ‘valid’, ‘test’)

傳回值：:

產生單詞清單以及對應詞性標籤的 DataPipe

返回類型：:

[list(str), list(str)]

問答 ¶

SQuAD 1.0 ¶

torchtext.datasets.SQuAD1(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[原始碼]¶

SQuAD1 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://rajpurkar.github.io/SQuAD-explorer/

每個分割區的行數

訓練集：87599
開發集：10570

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split - 要返回的分割或多個分割。可以是字串或字串元組。預設值：(train, dev)

傳回值：:

產生 SQuaAD1 資料集資料點的 DataPipe，資料點由上下文、問題、答案清單和在上下文中的對應索引組成

返回類型：:

(str, str, list(str), list(int))

SQuAD 2.0 ¶

torchtext.datasets.SQuAD2(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[原始碼]¶

SQuAD2 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://rajpurkar.github.io/SQuAD-explorer/

每個分割區的行數

訓練集：130319
開發集：11873

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
split - 要返回的分割或多個分割。可以是字串或字串元組。預設值：(train, dev)

傳回值：:

產生 SQuaAD1 資料集資料點的 DataPipe，資料點由上下文、問題、答案清單和在上下文中的對應索引組成

返回類型：:

(str, str, list(str), list(int))

非監督式學習 ¶

CC100 ¶

torchtext.datasets.CC100(root: str, language_code: str = 'en')[原始碼]¶

CC100 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 https://data.statmt.org/cc-100/

參數：:

root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
language_code - 資料集的語言

傳回值：:

產生語言代碼和文字元組的 DataPipe

返回類型：:

(字串, 字串)

EnWik9 ¶

torchtext.datasets.EnWik9(root: str)[原始碼]¶

EnWik9 資料集

警告

使用 datapipes 目前仍受到一些限制。如果您希望使用此資料集進行洗牌、多處理或分散式學習，請參閱本說明以獲得進一步的說明。

如需更多詳細資訊，請參閱 http://mattmahoney.net/dc/textdata.html

資料集中的行數：13147026

參數：:: root – 儲存資料集的目錄。預設值：os.path.expanduser(‘~/.torchtext/cache’)
傳回值：:: 從 WnWik9 資料集中產生原始文字行的 DataPipe
返回類型：:: 字串

torchtext.datasets¶

文字分類 ¶

AG_NEWS ¶

AmazonReviewFull ¶

AmazonReviewPolarity ¶

CoLA ¶

DBpedia ¶

IMDb ¶

MNLI ¶

MRPC ¶

QNLI ¶

QQP ¶

RTE ¶

SogouNews ¶

SST2 ¶

STSB ¶

WNLI ¶

YahooAnswers ¶

YelpReviewFull ¶

Yelp 評論極性 ¶

語言建模 ¶

Penn Treebank ¶

WikiText-2 ¶

WikiText103 ¶

機器翻譯 ¶

IWSLT2016 ¶

IWSLT2017 ¶

Multi30k ¶

序列標註 ¶

CoNLL2000Chunking ¶

UDPOS ¶

問答 ¶

SQuAD 1.0 ¶

SQuAD 2.0 ¶

非監督式學習 ¶

CC100 ¶

EnWik9 ¶

文件

教學

資源