roadmap

One service for the whole local AI stack.

v0.1.0-alpha ships the OpenAI-compatible inference layer: slots, dispatcher, hardware probe, bundled chat, signed self-update, Proxmox host-pressure segment. v0.2 takes the same lifecycle discipline and extends it into the rest of the local AI stack: agents, persistent memory, MCP, tools, image, voice, third-party extensions. The goal is one place on your hardware that runs the whole stack, instead of seven daemons glued together.

No dates below. Items are direction. The closer a thing is to the left column, the closer it is to running on your box.

Shipped

v1

OpenAI-compatible /v1/* API

shipped
Chat, completions, embeddings, rerank, transcriptions, speech, images. Every OpenAI SDK works unchanged against the local box.

Slot lifecycle state machine

shipped
Real atomic transitions (offline → pulling → starting → warming → ready → serving), persisted and SSE-streamed so the dashboard reflects reality, not snapshots.

Hardware-aware probe

shipped
Detects GPU, NPU, and unified memory; surfaces VRAM/RAM fit warnings inline before you load a model that won’t fit.

Bundled OpenWebUI

shipped
Prewired chat on :3001 pointed at the local API. Zero config; the installer wires it for you.

Cosign-signed self-update

shipped
Atomic version swap with rollback. Stable and nightly channels, GitHub OIDC-verified release tarballs.

Five-provider stack

shipped
llama.cpp (Vulkan / ROCm) for chat and embed, FLM for NPU multiplex, Moonshine for STT, Kokoro for TTS, ComfyUI for image generation. All first-class in v1.

Image generation

shipped
POST /v1/images/generations served by a ComfyUI provider on ROCm. Curated SDXL Turbo / SD 1.5 / Flux Schnell with license badges; the img slot runs inside the same lifecycle as chat, embed, and audio.

Caddy + auth + HTTPS + mDNS

shipped
install.sh --auth=basic brings up Caddy at the edge with basic_auth, Bearer tokens for the OpenAI API, automatic HTTPS (internal CA for .local, Let’s Encrypt for real domains), and a best-effort Avahi service file so hal0.local resolves on the LAN. Off by default; opt in when you’re ready to expose the box.

Installer overhaul

shipped
ASCII banner + step counter, extracted preflight.sh (disk + ports + python + systemd), hal0 doctor for re-running it post-install, hardware cards + auto-populated slots/primary.toml, contextual ERR-trap recovery hints, and a live-hello + QR + reachability finish.

Capability slots overlay

shipped
Embed, Voice, and Img capability cards plus an NPU backend rollup in the dashboard, backed by /etc/hal0/capabilities.toml and GET|POST /api/capabilities/*. The orchestrator picks one model per child, validates (backend, model) pairs against the catalog, and reconciles drift on every apply.

FLM NPU provider live

shipped
Self-contained ghcr.io/hal0ai/hal0-toolbox-flm:v1 image bundles FLM; chat and embed surfaced in the picker only when XDNA hardware and the local toolbox image are both present. Model namespace comes from flm list -j; STT slice deferred.

Proxmox host-pressure segment

shipped
Drop a read-only PVEAuditor API token into Settings → Proxmox integration and the dashboard memory bar shows the physical DIMM total plus a muted Proxmox host segment for other-tenant + ZFS ARC + kernel pressure. Token sits 0600 at /etc/hal0/proxmox.json and the API redacts it on read; bare-metal installs leave the panel off.

embed-rerank built-in slot

shipped
The capability orchestrator auto-creates an embed-rerank slot on first enable: bge-reranker-v2-m3-q4_k_m on port 8086 with --reranking, so /v1/rerankings stays separate from the chat surface on :8081.

First-run wizard rewrite

shipped
Eight linear steps from password through hardware, primary model, capabilities, HF token, license, install, done. HF-token step is conditional on a gated model selection; legacy 5-step picker and three IA prototypes deleted.

--models-dir install flag

shipped
install.sh --models-dir=PATH (or HAL0_MODELS_DIR, or an interactive tty prompt) points a fresh install at an existing model store. Persisted as [models].pull_root and included in [models].roots so the on-disk tree scans on first boot.

Models scan preview overhaul

shipped
POST /api/models/scan/preview is inspection-only; kind-driven gating (llama / moonshine / kokoro / flm), shared skip rules, per-row select, dedup across HF blob-cache symlinks, GGUF magic-byte detection regardless of extension, GGUF general.name → suggested name, HF-cache repo-name fallback.

Per-slot live metrics

shipped
GET /api/slots/metrics reads docker cgroup memory + ActiveEnterTimestampMonotonic + scraped llama-server /metrics, with a fallback to the systemd unit’s MemoryCurrent for native-host slots.

Orchestrator drift reconcile

shipped
CapabilityOrchestrator.apply() now reconciles capabilities.toml ↔ slots/*.toml on every call rather than only on selection diff, so prior failed applies, manual edits, and install-time seeds can no longer leave the two disagreeing.

Soon

v0.2

Hugging Face model pulls

in-progress
POST /api/models/{id}/pull (streaming HF download) is still 501 today; the picker + first-run wizard wire it for the next 0.1.x alpha. (FLM-aware pull for NPU tags landed in PR #89, so Ollama-style family:size ids already pull from the dashboard.)

Agent management & integration

in-progress
First-class agents stored on hal0 with conversation orchestration and deterministic tool-use loops. A place for agents to live on your box, not only to chat with.

Unified memory system

in-progress
Persistent, federated memory across local slots and (optionally) external models. Mem0-style API with sources, search, and scoping.

MCP support, host and server

in-progress
hal0 speaks Model Context Protocol both directions. Compose tools across local slots and external MCP services; discover them from the dashboard.

Extensions framework

in-progress
Third-party apps packaged as hal0 extensions: systemd unit + dashboard tile + healthcheck. Bring-your-own-app, lifecycle-managed.

Benchmarks & presets UI

in-progress
In-dashboard tok/s + latency runs, plus curated loadout presets you can flash onto a fresh install.

AUR PKGBUILD & Ubuntu PPA

in-progress
Native distro packages on top of the install script: pacman and apt as first-class install paths.

Exploring

post-v1.0

Multi-host federation

later
A slot mesh across LAN boxes: primary on the Strix Halo, embed on the workstation, all behind one /v1/* surface.

Fine-tune & LoRA hot-swap

later
Attach and rotate LoRAs against a warm base model without unloading the underlying weights.

Per-model rate limits & budgets

later
Cost-style accounting for local inference, so a chatty agent can be capped without taking the whole box down.

Voice mode end-to-end

later
Moonshine + agent loop + Kokoro stitched into a hands-free streaming conversation, gated through the slot lifecycle.

ChatOps adapters

later
Slack and Matrix bridges as extensions, so you can talk to hal0 from the rooms you already live in.

Exploring items are bets we believe in, not promises. Scope and order will shift as v0.2 lands and we learn what people actually do with a local platform.

shape it

The roadmap is a conversation.

File issues, open discussions, send patches. hal0 is Apache-2.0; the platform is yours to extend, fork, or steer. The agents / memory / MCP track in particular benefits from real-world use cases more than from our guesses.