Quickstart
Get your first local LLM running in minutes. This guide walks through installation, setup, and your first API call.
Setup
Install Osaurus
Install using Homebrew:
brew install --cask osaurus
Alternatively, download directly from GitHub releases.
Launch the Application
- Open Osaurus from Spotlight (⌘ Space) or Applications
- Look for the Osaurus icon in your menu bar
- Click the icon to access the control panel
Start the Server
- Click Start Server in the menu
- Wait for the status to show Running on port 1337
- Your local LLM server is now active
Download a Model
- Select Model Manager from the menu bar
- Browse available models or use search
- For first-time users, we recommend Llama 3.2 3B Instruct 4bit
- Balanced performance and quality
- 2GB download size
- Runs efficiently on 8GB+ Macs
- Click Download and wait for completion
Test the API
Verify your installation with a simple request:
curl -s http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [{"role":"user","content":"Hello! Tell me a fun fact about dinosaurs."}],
"max_tokens": 100
}' | jq
Using the Chat Interface
Osaurus includes an integrated chat interface for direct model interaction.
Accessing Chat
Press ⌘; (Command + Semicolon) to open the chat overlay. Type your message and press Enter to send. Press ⌘; again to close.
Chat Features
- Markdown Rendering — Formatted responses with syntax highlighting
- Copy Messages — Click the copy icon on any message
- Stop Generation — Interrupt streaming responses
- Model Selection — Switch models using the dropdown
- System Prompts — Configure in Settings → Chat
Example Requests
Creative Writing
curl -s http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [{"role":"user","content":"Write a haiku about coding late at night"}]
}' | jq -r '.choices[0].message.content'
Code Generation
curl -s http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [{"role":"user","content":"Write a Python function to reverse a string without using built-in functions"}]
}' | jq -r '.choices[0].message.content'
Streaming Responses
curl -N http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [{"role":"user","content":"Explain quantum computing in simple terms"}],
"stream": true
}'
Apple Foundation Models
On macOS 26 Tahoe or later, access Apple's system models:
# Check availability
curl -s http://127.0.0.1:1337/v1/models | jq '.data[] | select(.id=="foundation")'
# Use foundation model
curl -s http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "foundation",
"messages": [{"role":"user","content":"What makes Apple Silicon special?"}]
}' | jq
Command Line Interface
Control Osaurus from Terminal:
# Start server
osaurus serve --port 1337
# Enable LAN access
osaurus serve --port 1337 --expose
# Check status
osaurus status
# Stop server
osaurus stop
# Open UI
osaurus ui
Python Integration
Use the OpenAI SDK with Osaurus:
from openai import OpenAI
# Configure for local server
client = OpenAI(
base_url="http://127.0.0.1:1337/v1",
api_key="not-needed"
)
# Make a request
response = client.chat.completions.create(
model="llama-3.2-3b-instruct-4bit",
messages=[
{"role": "user", "content": "Write a joke about programming"}
]
)
print(response.choices[0].message.content)
JavaScript Integration
Access Osaurus from Node.js or browser environments:
const response = await fetch("http://127.0.0.1:1337/v1/chat/completions", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llama-3.2-3b-instruct-4bit",
messages: [{ role: "user", content: "What's the weather like on Mars?" }],
}),
});
const data = await response.json();
console.log(data.choices[0].message.content);
Model Recommendations
Fast Response Models (4-bit)
- Llama 3.2 3B Instruct — Excellent general-purpose model
- Qwen 2.5 3B Instruct — Strong reasoning capabilities
- Gemma 2 2B Instruct — Optimized for speed
Quality-Focused Models
- Llama 3.2 8B Instruct — Superior quality with reasonable speed
- Mistral 7B Instruct — Well-rounded performance
- DeepSeek Coder 7B — Specialized for programming tasks
High-Memory Systems (32GB+)
Consider 8-bit variants for enhanced quality or larger 13B-30B models for advanced use cases.
Performance Optimization
- Model Selection — Start with 4-bit models for optimal speed
- Resource Management — Close unnecessary applications
- Context Length — Shorter prompts yield faster responses
- Response Streaming — Improves perceived performance
- System Monitoring — Use Osaurus's built-in monitor
Troubleshooting
Model Not Found
- Verify download completion in Model Manager
- Check exact model name:
curl http://127.0.0.1:1337/v1/models - Ensure lowercase naming with hyphens
Slow Performance
- Try smaller models (3B vs 7B)
- Reduce
max_tokensparameter - Free up system memory
- Check Activity Monitor for resource usage
Connection Issues
- Verify server status:
osaurus status - Check port configuration (default: 1337)
- Ensure firewall allows localhost connections
Next Steps
Explore these resources to expand your usage:
- API Reference — Complete endpoint documentation
- Model Guide — Detailed model information
- Configuration — Customize Osaurus settings
- Integrations — Build with Osaurus
- Community — Connect with other users
You're now running AI locally on your Mac.
No cloud dependencies. No usage limits. Complete privacy.