The Morpheus proxy-router does not run inference itself. It forwards prompts to whatever OpenAI-compatible HTTP endpoint you point it at via models-config.json. That endpoint is your “backend LLM” or “model server.”
This page is intentionally short. Picking, sizing, and operating an inference engine is its own discipline; we link to the canonical references rather than maintaining our own.

Common backends

llama.cpp / llama-server

Single-binary CPU/GPU inference. Bundled in our local-only demo.

vLLM

Production-grade GPU serving with continuous batching.

Ollama

Easy local model server, OpenAI-compatible.

Hosted reseller

Front Venice / OpenAI / Anthropic via apiUrl + apiKey in models-config.json. See Resale provider.

What the proxy-router needs from your backend

  • OpenAI-compatible route appropriate for the model type:
    • LLM: /v1/chat/completions
    • Embeddings: /v1/embeddings
    • STT: /v1/audio/transcriptions
    • TTS: /v1/audio/speech
  • A stable, private URL the proxy-router can reach (e.g. http://10.0.0.5:8080/v1/chat/completions).
  • Enough concurrency to satisfy the concurrentSlots you advertise in models-config.json.

Capacity recommendations

There is no one-size-fits-all sizing — start by measuring on your own hardware. The tech.mor.org calculators help estimate revenue and tokens-per-second across hardware tiers; mirror summary at tech.mor.org (mirror).

TEE backends

For full Phase 2 attestation, the backend itself must run inside a SecretVM-style TEE that exposes attestation endpoints on :29343 (/cpu, /gpu, /docker-compose). See: