Provider matrix
Five providers ship first-class in v1. Each is a class with a small
contract (build_env() / start_cmd() / health() / infer()) that makes
them stateless and swappable. The slot lifecycle is provider-agnostic;
what changes between providers is the workload they serve and the
hardware they target.
The matrix
Section titled “The matrix”| Provider | Hardware | What it serves |
|---|---|---|
| llama.cpp | Vulkan (default) / ROCm (opt-in) | chat, embed, rerank, vision |
| FLM | AMD XDNA NPU (opt-in) | chat / embed / ASR multiplex |
| Moonshine | CPU | STT (/v1/audio/transcriptions) |
| Kokoro | CPU / Vulkan | TTS (/v1/audio/speech) |
| ComfyUI | ROCm (Strix Halo iGPU class) | Image gen (/v1/images/generations) |
All five are first-class in v1. FLM is opt-in because XDNA NPU support depends on AMD’s driver stack being present and a local FLM toolbox image; the picker only advertises NPU when both are detected.
llama.cpp
Section titled “llama.cpp”The default for primary and embed. Handles:
- Chat completions (
/v1/chat/completions). - Plain completions (
/v1/completions). - Embeddings (
/v1/embeddings). - Rerank (
/v1/rerankings, same backend process). - Vision (multimodal models, where the GGUF supports them).
Backend modes:
- Vulkan, the default. Runs on iGPUs (Strix Halo, RDNA3),
discrete AMD, and discrete NVIDIA cards via Vulkan. Toolbox image:
ghcr.io/hal0ai/hal0-toolbox-vulkan(pinned by sha256 inmanifest.json). - ROCm, opt-in via
ghcr.io/hal0ai/hal0-toolbox-rocm(also pinned by sha256). Faster on RDNA3 discrete cards and on Strix Halo’s iGPU where Vulkan leaves performance on the table.
The CUDA path on NVIDIA uses CUDA-backed llama.cpp through the same provider.
For AMD XDNA NPUs (the second AI engine on Strix Halo and newer Ryzen AI parts). Multiplexes chat, embed, and ASR workloads on the NPU, keeping the iGPU free for other slots.
Toolbox image: ghcr.io/hal0ai/hal0-toolbox-flm:v1. The image bundles
FLM at /opt/fastflowlm/, so no host bind-mount of the FLM tree is
required; the container’s ENTRYPOINT runs the in-image flm via tini.
Default port 8086. FLM has its own model namespace (you can’t run
arbitrary GGUFs through it); available tags come from
flm list -j against the bundled model_list.json.
Moonshine
Section titled “Moonshine”The STT provider. Targets edge-real-time speech-to-text: small model, low latency, designed for streaming.
CPU only. Upstream useful-moonshine-onnx ships an ONNX Runtime CPU
EP only; there’s no Vulkan/ROCm EP in the wheel, so the catalog pins
the moonshine runtime fan-out to ("cpu",) and the picker never
advertises a backend the slot can’t honour.
Toolbox image: ghcr.io/hal0ai/hal0-toolbox-moonshine (pinned by
sha256). See Audio for the endpoint shape.
Kokoro
Section titled “Kokoro”The TTS provider. Defaults to Kokoro-82M v1.0 (8 languages,
54 voices), with support for swapping to F5-TTS for voice cloning.
Toolbox image: ghcr.io/hal0ai/hal0-toolbox-kokoro (pinned by
sha256).
ComfyUI
Section titled “ComfyUI”The image-gen provider. Backs /v1/images/generations with curated
SDXL Turbo, SD 1.5, and Flux Schnell weights. hal0 owns the OpenAI ↔
ComfyUI translation; the upstream is treated as a black box that
speaks POST /prompt, GET /history/<id>, GET /view.
Toolbox image: ghcr.io/hal0ai/hal0-toolbox-comfyui:v1 (pinned by
sha256). Targets ROCm: the Strix Halo iGPU class is the v1
first-class target, the unified memory pool holds an SDXL-Turbo
checkpoint alongside a primary chat model.
How a provider plugs in
Section titled “How a provider plugs in”Every provider implements:
| Method | What it does |
|---|---|
build_env() | Compute the env file the systemd unit will consume. |
start_cmd() | The argv to run inside the toolbox image. |
health() | Cheap probe to decide warming → ready. |
infer() | The request path the dispatcher proxies to. |
The slot lifecycle (offline → pulling → starting → warming → ready → serving ↔ idle → unloading) is identical across providers. Adding
a new provider means implementing this contract; no slot-manager
changes required.