推論 API ¶

推論 API 預設在 8080 埠上監聽，且只能從 localhost 存取。若要變更預設設定，請參閱TorchServe 設定。

對於所有推論 API 請求，TorchServe 需要包含正確的推論權杖，否則必須停用權杖授權。如需更多詳細資訊，請參閱權杖授權文件

TorchServe 伺服器支援下列 API

API 描述 - 取得可用 API 和選項的清單
健康檢查 API - 取得執行中伺服器的健康狀態
預測 API - 從服務模型取得預測
解釋 API - 從服務模型取得解釋
KServe 推論 API - 從 KServe 取得服務模型的預測
KServe 解釋 API - 從 KServe 取得服務模型的解釋

API 描述¶

若要檢視推論 API 的完整清單，您可以使用下列指令

curl -X OPTIONS https://127.0.0.1:8080

輸出為 OpenAPI 3.0.1 json 格式。您可以使用它來產生客戶端程式碼，請參閱swagger codegen 以了解更多詳細資訊。

推論 API 描述輸出

健康檢查 API¶

此 API 遵循 InferenceAPIsService.Ping gRPC API。它會傳回 ModelServer 中模型的狀態。

TorchServe 支援 ping API，您可以呼叫該 API 來檢查執行中 TorchServe 伺服器的健康狀態

curl https://127.0.0.1:8080/ping

如果伺服器正在執行，則回應為

{
  "status": "Healthy"
}

“maxRetryTimeoutInSec”（預設值：5 分鐘）可以在模型的 config yaml 檔案（例如 model-config.yaml）中定義。它是復原死後端工作者的最長時間視窗。健康的工作者可以在 maxRetryTimeoutInSec 視窗中處於以下狀態：WORKER_STARTED、WORKER_MODEL_LOADED 或 WORKER_STOPPED。「Ping」端點”

傳回 200 + json 訊息“healthy”：對於任何模型，作用中的工作者數量等於或大於設定的 minWorkers。
傳回 500 + json 訊息“unhealthy”：對於任何模型，作用中的工作者數量小於設定的 minWorkers。

預測 API¶

此 API 遵循 InferenceAPIsService.Predictions gRPC API。它會傳回 ModelServer 中模型的狀態。

若要從每個已載入模型的預設版本取得預測，請發出 REST 呼叫至 /predictions/{model_name}

POST /predictions/{model_name}

curl 範例¶

curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg

curl https://127.0.0.1:8080/predictions/resnet-18 -T kitten_small.jpg

or:

curl https://127.0.0.1:8080/predictions/resnet-18 -F "data=@kitten_small.jpg"

若要從預期有多個輸入的已載入模型取得預測

curl https://127.0.0.1:8080/predictions/squeezenet1_1 -F 'data=@docs/images/dogs-before.jpg' -F 'data=@docs/images/kitten_small.jpg'

or:

import requests

res = requests.post("https://127.0.0.1:8080/predictions/squeezenet1_1", files={'data': open('docs/images/dogs-before.jpg', 'rb'), 'data': open('docs/images/kitten_small.jpg', 'rb')})

若要從每個已載入模型的特定版本取得預測，請發出 REST 呼叫至 /predictions/{model_name}/{version}

POST /predictions/{model_name}/{version}

curl 範例¶

curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg

curl https://127.0.0.1:8080/predictions/resnet-18/2.0 -T kitten_small.jpg

or:

curl https://127.0.0.1:8080/predictions/resnet-18/2.0 -F "data=@kitten_small.jpg"

結果是 JSON，告訴您該圖片很可能是一隻虎斑貓。最高的預測是

{
    "class": "n02123045 tabby, tabby cat",
    "probability": 0.42514491081237793
}

透過 HTTP 1.1 分塊編碼進行串流回應。TorchServe 的推論 API 支援串流回應，以允許透過 HTTP 1.1 分塊編碼傳送一系列推論回應。此新功能僅建議用於完整回應的推論延遲很高，且推論的中間結果會傳送到用戶端的使用案例。例如，LLM 用於生成應用程式，其中生成「n」個 token 可能會有很高的延遲，在這種情況下，使用者可以在每個生成的 token 準備好後收到它，直到完整回應完成。為了實現串流回應，後端處理器呼叫 “send_intermediate_predict_response” 以將一個中間結果傳送到前端，並以現有樣式傳回最後的結果。例如：

from ts.handler_utils.utils import send_intermediate_predict_response
''' Note: TorchServe v1.0.0 will deprecate
"from ts.protocol.otf_message_handler import send_intermediate_predict_response".
Please replace it with "from ts.handler_utils.utils import send_intermediate_predict_response".
'''
def handle(data, context):
    if type(data) is list:
        for i in range (3):
            send_intermediate_predict_response(["intermediate_response"], context.request_ids, "Intermediate Prediction success", 200, context)
        return ["hello world "]

用戶端接收分塊資料。

def test_echo_stream_inference():
    test_utils.start_torchserve(no_config_snapshots=True, gen_mar=False)
    test_utils.register_model('echo_stream',
                              'https://torchserve.pytorch.org/mar_files/echo_stream.mar')

    response = requests.post(TF_INFERENCE_API + '/predictions/echo_stream', data="foo", stream=True)
    assert response.headers['Transfer-Encoding'] == 'chunked'

    prediction = []
    for chunk in (response.iter_content(chunk_size=None)):
        if chunk:
            prediction.append(chunk.decode("utf-8"))

    assert str(" ".join(prediction)) == "hello hello hello hello world "
    test_utils.unregister_model('echo_stream')

Explanations API¶

TorchServe 利用 Captum 的功能來傳回所服務模型的解釋。

若要從每個已載入模型的預設版本取得解釋，請發出 REST 呼叫至 /explanations/{model_name}

POST /explanations/{model_name}

curl 範例¶

curl http://127.0.0.1:8080/explanations/mnist -T examples/image_classifier/mnist/test_data/0.png

結果是一個 json，提供您輸入圖片的解釋

  [
    [
      [
        [
          0.004570948731989492,
          0.006216969640322402,
          0.008197565423679522,
          0.009563574612830427,
          0.008999274832810742,
          0.009673474804303854,
          0.007599905146155397,
          ,
	        ,

        ]
      ]
    ]
  ]

KServe 推論 API¶

TorchServe 利用 KServe 推論 API 來傳回所服務模型的預測。

若要從已載入的模型取得預測，請發出 REST 呼叫至 /v1/models/{model_name}:predict

POST /v1/models/{model_name}:predict

curl 範例¶

 curl -H "Content-Type: application/json" --data @kubernetes/kserve/kf_request_json/v1/mnist.json http://127.0.0.1:8080/v1/models/mnist:predict

結果是一個 json，提供您輸入 json 的預測

{
  "predictions": [
    2
  ]
}

KServe Explanations API¶

TorchServe 利用 KServe API 規格來傳回所服務模型的解釋。

若要從已載入的模型取得解釋，請發出 REST 呼叫至 /v1/models/{model_name}:explain

/v1/models/{model_name}:explain

curl 範例¶

 curl -H "Content-Type: application/json" --data @kubernetes/kserve/kf_request_json/v1/mnist.json http://127.0.0.1:8080/v1/models/mnist:explain

結果是一個 json，提供您輸入 json 的解釋

{
  "explanations": [
    [
      [
        [
          0.004570948731989492,
          0.006216969640322402,
          0.008197565423679522,
          0.009563574612830427,
          0.008999274832810742,
          0.009673474804303854,
          0.007599905146155397,
          ,
          ,
	        ,
        ]
      ]
    ]
  ]
}

推論 API ¶

API 描述¶

健康檢查 API¶

預測 API¶

curl 範例¶

curl 範例¶

Explanations API¶

curl 範例¶

KServe 推論 API¶

curl 範例¶

KServe Explanations API¶

curl 範例¶

文件

教學課程

資源