models-config.json. That endpoint is your “backend LLM” or “model server.”
This page is intentionally short. Picking, sizing, and operating an inference engine is its own discipline; we link to the canonical references rather than maintaining our own.
Common backends
llama.cpp / llama-server
Single-binary CPU/GPU inference. Bundled in our local-only demo.
vLLM
Production-grade GPU serving with continuous batching.
Ollama
Easy local model server,
OpenAI-compatible.Hosted reseller
Front Venice / OpenAI / Anthropic via
apiUrl + apiKey in models-config.json. See Resale provider.What the proxy-router needs from your backend
- OpenAI-compatible route appropriate for the model type:
- LLM:
/v1/chat/completions - Embeddings:
/v1/embeddings - STT:
/v1/audio/transcriptions - TTS:
/v1/audio/speech
- LLM:
- A stable, private URL the proxy-router can reach (e.g.
http://10.0.0.5:8080/v1/chat/completions). - Enough concurrency to satisfy the
concurrentSlotsyou advertise inmodels-config.json.
Capacity recommendations
There is no one-size-fits-all sizing — start by measuring on your own hardware. Thetech.mor.org calculators help estimate revenue and tokens-per-second across hardware tiers; mirror summary at tech.mor.org (mirror).
TEE backends
For full Phase 2 attestation, the backend itself must run inside a SecretVM-style TEE that exposes attestation endpoints on:29343 (/cpu, /gpu, /docker-compose). See:
- TEE overview
- TEE reference
- Backend-side developer notes:
proxy-router/docs/tee-backend-verification.md

