Skip to main content

Models

Osaurus is model-agnostic. Run a fast 2B local model on the train, switch to GPT-4o at the office, hand off to Apple's on-device Foundation model on the weekend — your agents, memory, and tools stay intact across all of them.

What you can run

SourceWhere it runsmacOSSetup
MLX (local)On your Mac, Apple Silicon15.5+Download once via Model Manager
Apple FoundationOn your Mac, Apple Neural Engine26+Zero — model name is just foundation
Liquid FoundationOn your Mac15.5+Download via Model Manager
Cloud providersTheir servers15.5+API key in Management → Providers

Local models (MLX)

MLX is Apple's array framework with first-class GPU support via unified memory. Local models on Osaurus run through MLX with optimizations for Apple Silicon.

Downloading

  1. Open the Management window (⌘ ⇧ M) → Models
  2. Browse or search the catalog
  3. Click Download on a model
  4. Watch progress in the queue

Each entry shows name, parameter count, quantization (4-bit / 8-bit / JANGTQ / mxfp4), and total disk size.

Where models live

By default, models live at ~/MLXModels/. To put them on an external drive (helpful for big models), set OSU_MODELS_DIR:

export OSU_MODELS_DIR=/Volumes/External/MLXModels

To remove a model: Models → Downloaded → Delete.

Curated lineup on Hugging Face

Osaurus maintains its own optimized model library on Hugging Face. Downloads from the in-app Model Manager pull from this library by default. Highlights:

Small / fast — start here

API nameParamsSizeNotes
gemma-4-e2b-it-4bit2B~1.5 GBRecommended first model. Tool calling out of the box.
gemma-4-e2b-it-8bit2B~2.5 GBSame model, higher quality
gemma-4-e4b-it-4bit4B~2.6 GBStep up in capability
gemma-4-e4b-it-8bit4B~4.2 GB
laguna-xs.2-jangtq3B~1.6 GBOsaurusAI's tiny JANGTQ-quantized model

Mid-range

API nameParamsActiveSizeNotes
gemma-4-26b-a4b-it-4bit26B MoE4B~13 GBSolid coding + reasoning
gemma-4-26b-a4b-it-jang_4m26B MoE4BsmallerOsaurusAI JANGTQ variant
gemma-4-31b-it-jang_4m31Bdense
qwen3.5-35b-a3b-jang_2s35B MoE3Bsmallest
qwen3.5-35b-a3b-jang_4k35B MoE3B
qwen3.6-27b-jang_4m27Bdense

Vision-capable (image input + text output)

API nameParamsActiveSizeNotes
mistral-medium-3.5-128b-jangtq128BdenseMistral's flagship vision model
mistral-medium-3.5-128b-mxfp4128Bdensemxfp4 variant
nemotron-3-nano-omni-30b-a3b-jangtq230B MoE3BNVIDIA Nemotron Omni
nemotron-3-nano-omni-30b-a3b-jangtq430B MoE6B
holo3-35b-a3b-jangtq235B MoE3BHolo3 vision model
qwen3.6-35b-a3b-jangtq235B MoE3BQwen vision

Large frontier-class

API nameParamsActiveNotes
qwen3.5-122b-a10b-jang_2s122B MoE10BLargest available
qwen3.5-122b-a10b-jang_4k122B MoE10B
minimax-m2.7-jangtq15Bdense
minimax-m2.7-jangtq429Bdense
deepseek-v4-flash-jangtq21Bdense

For the canonical, always-up-to-date list, see OsaurusAI on Hugging Face.

About the quantizations

A model's filename hints at how it was compressed (smaller = uses less RAM, larger = higher quality):

SuffixWhat it is
4bit / 8bitStandard MLX integer quantization
mxfp4Apple's MX FP4 — block floating-point, great quality at 4-bit footprint
JANGTQ / JANGTQ2 / JANGTQ4OsaurusAI's curated quants, tuned for Apple Silicon. Better quality-to-size than off-the-shelf 4-bit.
JANG_2L, JANG_4M, JANG_2S, JANG_4KVariant codes — different bit widths × calibration recipes

Rule of thumb: for a model with multiple variants, try the one ending in JANGTQ4 first (best quality for the size), then drop to JANGTQ2 or JANG_2S if you need less RAM.

Tool calling

Tool calling works across every family above. Osaurus's tool-call parser handles JSON, Qwen XML, Mistral, GLM-4, LFM2, Kimi K2, Gemma 3/4, and MiniMax M2 dialects automatically — your agents don't care which model produced the call.

How much RAM does a model need?

Apple Silicon shares VRAM with system memory. Approximate RAM per model:

  • 4-bit: ~0.6 GB per billion parameters
  • 8-bit: ~1.2 GB per billion parameters
  • MoE models: only the active-parameter weights are touched per token, so a 35B/3B-active MoE behaves closer to a 3B model in steady-state memory

So gemma-4-e2b-it-4bit (2B, 4-bit) needs ~1.5 GB; qwen3.6-35b-a3b-jangtq2 (35B/3B-active) effectively needs RAM in the 3B-class plus expert weights swap-in.

Pick a quantization that leaves room for the rest of your work and your Core Model.

Loaded models and eviction

Configure how local models are cached in Settings → Local Inference → Model Management:

PolicyBehavior
Strict (One Model)Only one local model loaded at a time (default). Switching unloads the previous one.
Flexible (Multi Model)Multiple models loaded concurrently. Required if your Core Model is local and different from your chat model — otherwise the two will fight over the slot.

Models load on demand when a chat window opens (with prefix caching warm-up) and unload when no chat references them.

Apple Foundation Models

On macOS 26 (Tahoe) or later, you can use Apple's on-device system model with zero configuration and zero downloads.

curl http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "foundation",
"messages": [{"role":"user","content":"Hello!"}]
}'

The model name is literally foundation. Tool calling, streaming, and the standard generation parameters all work — Osaurus translates between OpenAI/Anthropic semantics and Apple's native interface automatically. It's also the recommended Core Model for memory and capability auto-selection on macOS 26+.

Apple Intelligence guide →

Liquid Foundation Models

Liquid AI's LFM family is built on a non-transformer architecture optimized for edge deployment. Highlights:

  • Fast token generation on Apple Silicon
  • Low memory footprint compared to equivalent-quality transformers
  • Strong tool calling out of the box

Download LFM models the same way as any other MLX model — they appear in the Model Manager catalog.

Cloud providers

Connect to cloud providers when you need more power. Each provider's models appear alongside local models in the model picker; switching is one click.

ProviderNotes
OpenAIGPT-4o, o-series, etc. via OpenAI Chat Completions
AnthropicClaude family via Anthropic Messages
GeminiGoogle Gemini
xAI / GrokxAI's Grok via OpenAI-compatible endpoint
Venice AIPrivacy-focused, uncensored, no data retention
OpenRouterOne key, many providers (openai/gpt-4o, anthropic/claude-3.5-sonnet, …)
OllamaLocal or remote Ollama servers
LM StudioLM Studio's local server

Add a provider via Management → Providers → Add Provider. API keys are stored in the macOS Keychain. Remote Providers →

Memory and agent context persist across providers — switching from your local Gemma to Claude 4 or GPT-4o doesn't lose your agent's memory.

Model naming

API model names are the model's display name in lowercase with hyphens for spaces:

Display nameAPI name
Gemma 4 E2B it 4bitgemma-4-e2b-it-4bit
Qwen3.6 35B A3B JANGTQ2qwen3.6-35b-a3b-jangtq2
Mistral Medium 3.5 128B JANGTQmistral-medium-3.5-128b-jangtq

List models from any client:

curl http://127.0.0.1:1337/v1/models

Per-request settings

Most behavior is per-request via the API. Common parameters:

{
"model": "gemma-4-e2b-it-4bit",
"messages": [{ "role": "user", "content": "Hello" }],
"temperature": 0.7,
"max_tokens": 1000,
"top_p": 0.9,
"stream": true
}

Recommended temperature ranges:

Use caseTemperature
Code, deterministic tasks0.0–0.3
Factual responses0.0–0.3
General chat0.5–0.7
Creative writing0.7–1.0

Full API reference →

Context length

Each model has its own context limit, which Osaurus picks sane defaults for automatically. Multi-turn caching is also automatic — repeating the same system prompt across messages is cheap. For tunables, see Inference Runtime.

Troubleshooting

"Model not found"

  • Check it's downloaded: Management → Models → Downloaded
  • List API model names: curl http://127.0.0.1:1337/v1/models
  • Match the API name exactly (lowercase, hyphens)

Slow generation

  • Try a smaller / more aggressively-quantized variant (e.g. JANGTQ2 instead of JANGTQ4)
  • Close memory-hungry apps
  • Reduce max_tokens
  • Watch Activity Monitor for memory pressure

Download fails

  • Check internet connection and disk space
  • Pause and resume; partial files are kept
  • Try a different mirror via the model card's "Source" link

Out of memory

  • Switch to a more aggressive quantization (JANGTQ2, JANG_2S, or 4-bit instead of 8-bit)
  • Reduce max_tokens
  • Consider a smaller model (drop from MoE-large to gemma-4-e2b-it-4bit)
  • Switch to Strict (One Model) eviction policy if you have multiple loaded

Under the hood

Curious about continuous batching, the KV cache, batch size tuning, or how the inference path is structured? See Inference Runtime.


Related: