Name: hal0
Author: hal0ai

Why hal0

Built like a real homelab service.

Strix Halo native

UMA-aware probe, FLM provider for the XDNA NPU, unified-memory slot-fit warnings sized to the real pool rather than the BAR carve-out. The 128 GB Ryzen AI Max+ 395 is the reference deployment. Every perf number on this page was measured there.

Concurrent chat + embed + voice

Five built-in slot classes (chat, embed, STT, TTS, image), each a real systemd-managed process on its own port. Run them at once. Primary + embed concurrent on Strix Halo measures ~258 tok/s with <200 ms dispatch.

Image gen, day one

POST /v1/images/generations served by ComfyUI on ROCm, inside the same slot lifecycle as everything else. Curated SDXL Turbo, SD 1.5, and Flux Schnell ship pre-pinned by sha256. Image gen is a real slot, on the same lifecycle as chat.

Dispatcher with single-flight

Registry-aware routing across local slots and external upstreams (OpenRouter, Anthropic, OpenAI). Cold-cache prefetch with request coalescing folds a thundering herd of identical prefetches into one HTTP call. Every routing decision logged as structured breadcrumbs.

Runs in your rack

A tenant on your Proxmox node that admits it.

hal0 is happy in a privileged LXC with iGPU and XDNA passthrough (the recipe is in the docs). Drop a read-only PVE API token in Settings and the dashboard's unified-memory bar shows the physical DIMM total, your other tenants, ZFS ARC, and the free pool. Token sits 0600 at /etc/hal0/proxmox.json and the API redacts it on read.

hal0 unified-memory bar showing GTT inference, system RAM, a muted Proxmox host segment, and free unified memory. — Dashboard memory bar. The Proxmox host segment is the other-tenant + ZFS ARC + kernel pressure competing for the same physical RAM pool.

Lives in an LXC

Privileged LXC + apparmor unconfined + dev0–dev3 passthrough gets you a full iGPU and XDNA NPU without burning a whole VM's worth of RAM. Bare-metal works too.

Behind your Traefik

Single port (:8080) for the API, :3001 for OpenWebUI. Add a vhost on your Traefik or Caddy and you're done. Or run --auth=basic for a bundled Caddy with HTTPS and basic auth.

NFS-friendly model store

/var/lib/hal0/models is just a directory. Mount it from your NAS, point a fresh install at it with --models-dir=PATH, and switch backends without re-pulling.

systemd all the way

Slots are hal0-slot@<name>.service instances on a template unit. systemctl status hal0-* tells you what's actually running. Not an Electron app, not a daemon you have to nurse.

What ships in v1

The whole local AI stack, one install.

Five-provider stack

llama.cpp (Vulkan / ROCm / CUDA) for chat, embed, and rerank. FLM for the XDNA NPU. Moonshine for STT. Kokoro for TTS. ComfyUI for image gen. All wrapped in the same slot lifecycle.

OpenAI /v1/* surface

Chat, completions, embeddings, rerank, audio transcriptions, audio speech, images, models. Same shapes any OpenAI SDK already speaks. Point your client at localhost:8080 and go.

Slot state machine

Every workload has typed states (offline → pulling → warming → ready → serving ↔ idle → unloading) with atomic transitions, persisted to state.json and streamed over SSE.

Capability cards

Dashboard groups slots into Embed, Voice, and Img cards plus an NPU backend rollup. Pick a model and the orchestrator validates the (backend, model) pair against the catalog and reconciles the underlying slot on every apply. State lives in /etc/hal0/capabilities.toml.

Prewired OpenWebUI

Chat UI on :3001 with zero config. The installer points it at the local hal0 API. The dashboard is for operating the box, not chatting.

Dashboard (Vue 3)

Nine views (Slots, Models, Hardware, Logs, Settings, Providers, First-run, plus error shell). Dark by default. SSE-backed live status and log tail.

Auth + HTTPS, one flag

Off by default for trusted-LAN installs. --auth=basic brings up Caddy with basic_auth at the edge, bearer tokens for the OpenAI API, and automatic HTTPS (internal CA for .local, Let's Encrypt for real domains). No certbot ritual.

Atomic self-update

hal0 update --channel stable|nightly. Cosign-verified tarballs swap a current symlink; --rollback reverts. Slot units survive API restarts.

One-line install

Linux + systemd, idempotent, non-interactive. Pre-flight checks, systemd template units, working slot defaults dropped on disk. No manual yaml.

See it running

One dashboard, every slot, live.

KPIs across the top, the unified-memory bar in the middle (with the Proxmox host segment when a PVE token is configured), the slot grid below with per-slot T/S, ACT, MEM, and uptime. SSE keeps the numbers honest; systemctl is-active is the floor, not the ceiling.

hal0 dashboard at root path showing API and slot KPIs, the unified-memory bar with a Proxmox host segment, and the slot grid below. — Dashboard at `/`. The memory bar accounts for the LXC's GTT pool, the host's other PVE tenants, and free unified memory.

Hardware

Strix Halo is the reference. The rest of the matrix is honest.

Linux + systemd is the only hard requirement (installer/install.sh:86). The probe picks the right provider; you don't pin a backend by hand.

Provider matrix — picked automatically by the hardware probe

Hardware	Vendor	Unified / VRAM	Support	Notes
AMD Ryzen AI Max+ 395	AMD · "Strix Halo"	Unified 128 GB	first-class	Reference deployment. iGPU + XDNA NPU + UMA-aware probe. Vulkan default; FLM for NPU.
AMD Ryzen AI Max 385 / 390	AMD · "Strix Halo"	Unified 64 GB	first-class	Same path as the 128 GB SKU; small + mid tiers fit comfortably, 70B Q4 with shorter context.
NVIDIA RTX 30/40/50	NVIDIA	10–32 GB	supported	CUDA-backed llama.cpp. Same slot lifecycle, dedicated VRAM instead of UMA.
AMD Radeon RX 7000 / discrete	AMD	16–24 GB	supported	Vulkan path today; ROCm toolbox image on the build list for opt-in.
CPU-only x86_64	Intel / AMD	System RAM	experimental	Vulkan-CPU fallback. Small models only. CI runs Qwen 0.5B here. Not the headline experience.

macOS and Windows are not in scope for v1. The FLM NPU provider is live with a self-contained ghcr.io/hal0ai/hal0-toolbox-flm:v1 image; NPU benchmarks land once the manifest digest pin lands in hal0/manifest.json.

Recommended loadouts

Three starting points. Mix, match, swap.

Curated picks for the 128 GB Strix Halo — refreshed to the latest open-weight releases as of May 2026. The slot system takes a different model per slot whenever you change your mind. See the full hardware page for discrete-GPU and CPU loadouts.

Coding · mid

~19 GB

primary · Qwen3-Coder-30B-A3B-Instruct-Q4_K_M
embed · nomic-embed-text-v2-moe-Q4_K_M

MoE with 3B active params. Runs near 3B speeds, reasons like a 30B. Pairs with a 140 MB embed for repo-aware search.

Voice mode

~3 GB

primary · Qwen3-4B-Instruct-2507-Q4_K_M
stt · Moonshine base
tts · Kokoro-82M v1.0

Low-latency reply, edge-built STT, 54-voice TTS. 128 GB leaves the entire rest of the budget free for a second chat model warm in another slot.

Maxed-out · long context

~50 GB

primary · Llama-4-Scout-17B-16E-Instruct-Q4_K_M
embed · bge-m3 (8192-token context)

10M-token context, MoE with 17B active. The biggest realistic single-model loadout that still leaves room for STT/TTS slots warm on 128 GB unified.

Loadouts are starting points. Every real install ends up tweaked. Sizes are published GGUF file sizes (Hugging Face, May 2026); no tok/s numbers on this page.

Quickstart

From zero to a live `/v1/chat` in two commands.

The installer is idempotent and non-interactive. It probes the hardware, writes /etc/hal0/hardware.json, drops working slot defaults, and brings the API up on :8080.

1 · install

install.sh

sh

# install on any modern Linux box with systemd
curl -fsSL https://hal0.dev/install.sh | bash

# optional overrides
# HAL0_PORT=9090 HAL0_PREFIX=/opt/hal0 curl … | bash

2 · chat

/v1/chat/completions

sh

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-0.5b-instruct-q4_k_m",
    "messages": [{"role":"user","content":"Hello!"}]
  }'

Prefer a chat UI? OpenWebUI ships prewired on :3001, pointed at the local hal0 API out of the box.

Comparison

Where hal0 sits next to ollama, LM Studio, and raw llama.cpp.

hal0 isn't an inference engine. It's the orchestration, lifecycle, and multi-modal surface around llama.cpp, FLM, Moonshine, and Kokoro. Honest take, in one table.

Feature	hal0	ollama	LM Studio	raw llama.cpp	Cloud API
OpenAI /v1/* surface	chat, embed, rerank, STT, TTS	chat-only subset	chat-only	raw HTTP	full
systemd-managed lifecycle	✓	partial	desktop app	DIY	n/a
Hardware probe + fit warnings	✓	—	—	—	n/a
Headless one-line install	✓	✓	GUI installer	manual	n/a
Multi-model concurrent slots	✓	partial	—	DIY	✓
Bundled chat UI	OpenWebUI prewired	—	built-in	—	varies
Signed self-update + rollback	cosign	manual	desktop updater	manual	n/a
Data stays on your box	✓	✓	✓	✓	—

vs. ollama. systemd-managed slots survive hal0-api restarts. The OpenAI surface includes embeddings, rerank, and STT/TTS, not only chat. Hardware probe and slot fit warnings are first-class.

vs. LM Studio. Linux-first, headless-first, one-line install, no GUI required. Prewired OpenWebUI handles chat; the dashboard is for operating the box.

vs. raw llama.cpp. hal0 owns the lifecycle: health probes, atomic env writes, cold-boot grace, single-flight prefetch, structured errors, signed self-update with rollback.

vs. cloud APIs. Your hardware, your data, your models. External upstreams (OpenRouter, Anthropic, OpenAI, custom) can be configured as fallbacks behind the same /v1/* surface, so you can mix local and remote per-model in one config.

Roadmap

v1 shipped. v0.2 is where it gets interesting.

Phase 1 landed on 2026-05-15 with 353 unit tests passing and the integration tier on Vulkan-CPU + Qwen 0.5B. Here's what's next.

Full roadmap →

Now

v1

Capability slots overlay

shipped

v1

Embed / Voice / Img cards + NPU backend rollup, backed by /etc/hal0/capabilities.toml and GET|POST /api/capabilities/*. Drift between capabilities.toml ↔ slots/*.toml reconciles on every apply.

FLM NPU live

shipped

v1

Self-contained ghcr.io/hal0ai/hal0-toolbox-flm:v1 image. Chat + embed surfaced in the picker when XDNA and the toolbox image are both present; model namespace from flm list -j.

Image generation

shipped

v1

POST /v1/images/generations served by ComfyUI on ROCm. Curated SDXL Turbo / SD 1.5 / Flux Schnell, slot named img.

First-run wizard

shipped

v1

Eight linear steps from password through hardware, capabilities, conditional HF token, license, install, done. Legacy 5-step picker and three IA prototypes deleted.

Packaging

v0.2

AUR + Ubuntu PPA

Benchmarks + Presets UI

Trust

Boring guarantees, in writing.

License

Apache-2.0

Patent grant included. Bundle, fork, ship.

Telemetry

Off by default

Opt-in surfaces hw class, version, slot count. No model names. No IPs. No config.

Releases

cosign-signed

hal0 update verifies GitHub-OIDC signer identity before unpacking.

Source

github.com/hal0ai/hal0 →

Issues, discussions, release manifests. Pre-alpha; see CONTRIBUTING for contribution status.

Community

Homelab AI inference platform for tinkerers and devs.

Built like a real homelab service.

Strix Halo native

Concurrent chat + embed + voice

Image gen, day one

Dispatcher with single-flight

A tenant on your Proxmox node that admits it.

Lives in an LXC

Behind your Traefik

NFS-friendly model store

systemd all the way

The whole local AI stack, one install.

Five-provider stack

OpenAI /v1/* surface

Slot state machine

Capability cards

Prewired OpenWebUI

Dashboard (Vue 3)

Auth + HTTPS, one flag

Atomic self-update

One-line install

One dashboard, every slot, live.

Strix Halo is the reference. The rest of the matrix is honest.

Three starting points. Mix, match, swap.

From zero to a live /v1/chat in two commands.

Where hal0 sits next to ollama, LM Studio, and raw llama.cpp.

v1 shipped. v0.2 is where it gets interesting.

Capability slots overlay

FLM NPU live

Image generation

First-run wizard

Hugging Face model pulls

Memory subsystem

MCP support

AUR + Ubuntu PPA

Benchmarks + Presets UI

Boring guarantees, in writing.

Built in the open. Talk to us in the open.

GitHub Issues →

GitHub Discussions →

hello@hal0.dev →

Install hal0 in under a minute.

From zero to a live `/v1/chat` in two commands.