Model Management
Osaurus supports a wide variety of MLX-optimized models and Apple Foundation Models. This guide covers model selection, management, and optimization.
Model Recommendations
Best Overall Performance
Llama 3.2 3B Instruct (4-bit)
- Excellent quality and speed balance
- 2GB download size
- Suitable for most tasks
- Performs well on 8GB+ systems
Code Generation
DeepSeek Coder 7B (4-bit)
- Specialized for programming tasks
- Strong multi-language support
- 4GB download size
- Ideal for code reviews and generation
Maximum Speed
Gemma 2 2B Instruct (4-bit)
- Ultra-fast response times
- 1.5GB download size
- Good for simple tasks
- Runs efficiently on all M-series Macs
Highest Quality
Llama 3.2 8B Instruct (4-bit)
- Superior quality under 10B parameters
- 5GB download size
- More nuanced responses
- Recommended for 16GB+ RAM
Model Manager
Access the Model Manager through the Osaurus menu bar icon.
Downloading Models
- Click the Osaurus menu bar icon
- Select Model Manager
- Browse or search for models
- Click Download on your chosen model
- Monitor progress in the download queue
Model Information
Each model displays:
- Name — Model identifier
- Size — Download and disk size
- Quantization — Bit precision (4-bit, 8-bit)
- Parameters — Model size in billions
- Download Status — Current state
Managing Storage
Models are stored in:
~/Library/Containers/ai.dinoki.osaurus/Data/Library/Application Support/models/
To remove models:
- Open Model Manager
- Find the downloaded model
- Click Delete
- Confirm removal
Model Types
MLX Models
MLX models are optimized specifically for Apple Silicon:
- 4-bit Quantization — Best speed/quality trade-off
- 8-bit Quantization — Higher quality, more memory
- 16-bit — Maximum quality, significant memory usage
Apple Foundation Models
Available on supported macOS versions:
# Use with model ID "foundation"
curl -X POST http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "foundation",
"messages": [{"role": "user", "content": "Hello"}]
}'
Features:
- System-integrated model
- No download required
- Optimized for Apple Silicon
- Privacy-focused design
Model Naming Convention
Osaurus uses consistent model naming:
{model-family}-{version}-{size}-{variant}-{quantization}
Examples:
llama-3.2-3b-instruct-4bitmistral-7b-instruct-v0.2-4bitdeepseek-coder-7b-instruct-4bit
Performance Characteristics
Memory Requirements
| Model Size | 4-bit | 8-bit | 16-bit |
|---|---|---|---|
| 2-3B | 2-3GB | 4-6GB | 8-12GB |
| 7-8B | 4-5GB | 8-10GB | 16-20GB |
| 13B | 8-10GB | 16-20GB | 32-40GB |
| 30B+ | 20-25GB | 40-50GB | 80-100GB |
Speed Benchmarks
Typical tokens per second on M2:
| Model | 4-bit | 8-bit |
|---|---|---|
| 3B | 40-60 | 30-45 |
| 7B | 20-35 | 15-25 |
| 13B | 12-20 | 8-15 |
Model Configuration
Context Length
Default context lengths by model family:
- Llama 3.2 — 4096 tokens
- Mistral — 8192 tokens
- Qwen 2.5 — 32768 tokens
- DeepSeek — 4096 tokens
Temperature Settings
Recommended temperature ranges:
- Creative Writing — 0.7-1.0
- Code Generation — 0.1-0.3
- General Chat — 0.5-0.7
- Factual Responses — 0.0-0.3
System Prompts
Configure default system prompts in Settings:
{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [
{
"role": "system",
"content": "You are a helpful, concise assistant."
},
{
"role": "user",
"content": "Explain quantum computing"
}
]
}
Model Selection Guide
By Use Case
General Purpose
- Llama 3.2 3B/8B
- Mistral 7B
- Qwen 2.5 3B/7B
Programming
- DeepSeek Coder 7B
- Code Llama 7B/13B
- Qwen 2.5 Coder
Creative Writing
- Llama 3.2 8B
- Mistral 7B
- Neural Chat 7B
Technical/Scientific
- Llama 3.2 8B
- Qwen 2.5 7B
- Mistral 7B
By System Resources
8GB RAM
- 2-3B models (4-bit)
- Single model at a time
16GB RAM
- 7-8B models (4-bit)
- 3B models (8-bit)
- Multiple small models
32GB+ RAM
- 13B models (4-bit)
- 7-8B models (8-bit)
- Larger context windows
Advanced Configuration
There are no global model aliasing or preloading options at this time. Control behavior per request via the OpenAI-compatible API.
Troubleshooting
Model Not Found
- Verify model is downloaded in Model Manager
- Check exact model name:
curl http://127.0.0.1:1337/v1/models - Ensure correct spelling and format
Slow Performance
- Check Activity Monitor for memory pressure
- Try smaller or more quantized models
- Close unnecessary applications
- Reduce context length in requests
Download Issues
- Check internet connection
- Verify available disk space
- Try pausing and resuming download
- Check Model Manager logs
Memory Errors
- Monitor RAM usage during inference
- Switch to more quantized versions
- Reduce max_tokens in requests
- Consider smaller models
Model Updates
Osaurus periodically updates available models:
- New models appear automatically in Model Manager
- Updated versions are marked with badges
- Old versions remain usable until deleted
- Check GitHub releases for model announcements
Questions about models? Join our Discord community or check the benchmarks page.