使用 Qualcomm AI Engine Direct 後端建置和執行 Llama 3 8B Instruct¶

本教學示範如何為 Qualcomm AI Engine Direct 後端匯出 Llama 3 8B Instruct，並在 Qualcomm 裝置上執行模型。

先決條件¶

如果您尚未按照設定 ExecuTorch設定儲存庫和開發環境，請先設定您的 ExecuTorch 儲存庫和環境。
閱讀使用 Qualcomm AI Engine Direct 後端建置和執行 ExecuTorch 頁面，以了解如何使用 Qualcomm AI Engine Direct 後端在 Qualcomm 裝置上匯出和執行模型。
請依照executorch llama 的 README，了解如何透過 ExecuTorch 在行動裝置上執行 llama 模型。
一台配備 16GB RAM 的 Qualcomm 裝置
- 我們正持續優化記憶體使用量，以確保與較低記憶體的裝置相容。
Qualcomm AI Engine Direct SDK 的版本為 2.26.0 或更高。

說明¶

步驟 1：準備模型檢查點以及來自 Spin Quant 的最佳化矩陣¶

關於 Llama 3 的 tokenizer 和檢查點，請參考 https://github.com/meta-llama/llama-models/blob/main/README.md，以取得關於如何下載 tokenizer.model、consolidated.00.pth 和 params.json 的進一步說明。
若要取得最佳化矩陣，請參考 GitHub 上的 SpinQuant。您可以在 Quantized Models 區段中下載最佳化的旋轉矩陣。請選擇 LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0。

步驟 2：使用 Qualcomm AI Engine Direct Backend 匯出至 ExecuTorch¶

在裝置上部署像 Llama 3 這樣的大型語言模型會面臨以下挑戰：

模型大小太大，無法放入裝置記憶體中進行推論。
模型載入和推論時間過長。
量化困難。

為了應對這些挑戰，我們實施了以下解決方案：

使用 --pt2e_quantize qnn_16a4w 來量化 activations 和 weights，從而減少磁碟上的模型大小，並減輕推論期間的記憶體壓力。
使用 --num_sharding 8 將模型分割成子部分。
執行圖形轉換，將運算轉換或分解為更適合加速器的運算。
使用 --optimized_rotation_path <path_to_optimized_matrix> 來應用 Spin Quant 的 R1 和 R2，以提高準確性。
使用 --calibration_data "<|start_header_id|>system<|end_header_id|..." 以確保在量化 Llama 3 8B instruct 時，校準包含提示範本中的特殊 token。有關提示範本的更多詳細資訊，請參考 meta llama3 instruct 的模型卡。

若要使用 Qualcomm AI Engine Direct Backend 匯出 Llama 3 8B instruct，請確保以下事項：

主機擁有超過 100GB 的記憶體（RAM + swap space）。
整個過程需要幾個小時。

# Please note that calibration_data must include the prompt template for special tokens.
python -m examples.models.llama.export_llama -t <path_to_tokenizer.model>
llama3/Meta-Llama-3-8B-Instruct/tokenizer.model -p <path_to_params.json> -c <path_to_checkpoint_for_Meta-Llama-3-8B-Instruct>  --use_kv_cache  --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --optimized_rotation_path <path_to_optimized_matrix> --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

步驟 3：在配備 Qualcomm SoCs 的 Android 智慧型手機上調用 Runtime¶

使用 Qualcomm AI Engine Direct Backend 為 Android 建置 executorch

cmake \
    -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake" \
    -DANDROID_ABI=arm64-v8a \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DQNN_SDK_ROOT=${QNN_SDK_ROOT} \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out .

cmake --build cmake-android-out -j16 --target install --config Release

為 Android 建置 llama runner

    cmake \
        -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}"/build/cmake/android.toolchain.cmake  \
        -DANDROID_ABI=arm64-v8a \
        -DCMAKE_INSTALL_PREFIX=cmake-android-out \
        -DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
        -DEXECUTORCH_BUILD_QNN=ON \
        -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
        -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
        -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
        -Bcmake-android-out/examples/models/llama examples/models/llama

    cmake --build cmake-android-out/examples/models/llama -j16 --config Release

透過 adb shell 在 Android 上執行 先決條件：確保您已在手機上的開發人員選項中啟用 USB 偵錯

3.1 連接您的 Android 手機

3.2 我們需要將所需的 QNN 函式庫推送到裝置。

# make sure you have write-permission on below path.
DEVICE_DIR=/data/local/tmp/llama
adb shell mkdir -p ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnSystem.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV69Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV73Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV75Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v69/unsigned/libQnnHtpV69Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v73/unsigned/libQnnHtpV73Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so ${DEVICE_DIR}

3.3 將模型、tokenizer 和 llama runner 二進位檔案上傳到手機

adb push <model.pte> ${DEVICE_DIR}
adb push <tokenizer.model> ${DEVICE_DIR}
adb push cmake-android-out/lib/libqnn_executorch_backend.so ${DEVICE_DIR}
adb push cmake-out-android/examples/models/llama/llama_main ${DEVICE_DIR}

3.4 執行模型

adb shell "cd ${DEVICE_DIR} && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.model> --prompt \"<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n\" --seq_len 128"

您應該會看到以下訊息：

<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello! I'd be delighted to chat with you about Facebook. Facebook is a social media platform that was created in 2004 by Mark Zuckerberg and his colleagues while he was a student at Harvard University. It was initially called "Facemaker" but later changed to Facebook, which is a combination of the words "face" and "book". The platform was initially intended for people to share their thoughts and share information with their friends, but it quickly grew to become one of the

接下來有什麼？¶

改善 Llama 3 Instruct 的效能
減少推論期間的記憶體壓力，以支援 12GB Qualcomm 裝置
支援更多 LLM

常見問題解答¶

如果您在重現本教學課程時遇到任何問題，請在 ExecuTorch 儲存庫上提交 github issue 並使用 #qcom_aisw 標籤