在 llama.cpp 中使用 GGUF

提示

你现在可以在 Hugging Face Endpoints 上部署任何与 llama.cpp 兼容的 GGUF，了解更多信息请访问这里

Llama.cpp 允许你通过提供 Hugging Face 仓库路径和文件名来下载和运行 GGUF 推理。llama.cpp 会下载模型检查点并自动缓存它。缓存位置由 LLAMA_CACHE 环境变量定义；了解更多信息请访问这里。

你可以通过 brew（在 Mac 和 Linux 上工作）安装 llama.cpp，或者从源代码构建。还有预构建的二进制文件和 Docker 镜像，你可以在官方文档中查看。

选项 1：使用 brew/winget 安装

brew install llama.cpp

或者在 Windows 上通过 winget

winget install llama.cpp

选项 2：从源代码构建

步骤 1：从 GitHub 克隆 llama.cpp。

git clone https://github.com/ggerganov/llama.cpp

步骤 2：进入 llama.cpp 文件夹并构建它。你还可以添加硬件特定的标志（例如：对于 Nvidia GPU 使用 -DGGML_CUDA=1）。

cd llama.cpp
cmake -B build   # 可选，添加 -DGGML_CUDA=ON 以激活 CUDA
cmake --build build --config Release

注意：对于其他硬件支持（例如：AMD ROCm、Intel SYCL），请参阅 llama.cpp 的构建指南

安装完成后，你可以按如下方式使用 llama-cli 或 llama-server：

llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

注意：你可以显式添加 -no-cnv 以在原始完成模式（非聊天模式）下运行 CLI。

此外，你可以直接使用 llama.cpp 服务器调用 OpenAI 规范的聊天完成端点：

llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

运行服务器后，你可以按如下方式使用端点：

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
    {
        "role": "system",
        "content": "You are an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
    },
    {
        "role": "user",
        "content": "Write a limerick about Python exceptions"
    }
  ]
}'

将 -hf 替换为任何有效的 Hugging Face Hub 仓库名称——开始使用吧！🦙

选项 1：使用 brew/winget 安装​

选项 2：从源代码构建​

选项 1：使用 brew/winget 安装

选项 2：从源代码构建