Thaluna 3.0 Preview

OpenAI-Compatible Local API Setup

Thaluna 3.0 Preview can connect to a local OpenAI-compatible chat completion server. This is useful when you want to use llama.cpp, LM Studio, or another local model stack instead of the built-in models, Ollama, or OpenRouter.

Recommended llama.cpp settings for Thaluna

  • Use OpenAI-compatible server mode.
  • Base URL: http://127.0.0.1:8080/v1
  • Context: 4096-8192
  • Temperature: 0.1-0.3
  • Max tokens: 512-2048

Example:

llama-server -m model.gguf --port 8080 --ctx-size 8192

Thaluna sends short OCR translation requests, so very large context windows are usually not needed.

Where to Find the Local API Options

Local API option in the Thaluna translation model selector
Select Local API from the translation model selector when you want Thaluna to use your OpenAI-compatible local server.
Custom Local API settings in Thaluna Ollama and Cloud settings
Set the Custom Local API base URL, model ID, and optional API key in Settings > Ollama/Cloud.

Setup Steps

1. Start your local server

Start llama.cpp, LM Studio, or another OpenAI-compatible server before selecting Local API in Thaluna.

2. Set the base URL

In Thaluna, open Settings -> Ollama/Cloud and set the Custom Local API base URL. For llama.cpp on port 8080, use http://127.0.0.1:8080/v1.

3. Check the model ID

Keep the model ID matching your local server. The default local-model is fine for many local servers.

4. Leave API key empty unless required

Most localhost servers do not require an API key. Only fill the key field if your server expects one.

5. Select Local API

Open the translation model selector, choose Local API, pick your target language, and confirm.

6. Watch your server logs

If translation stalls, check the local server console first. A runaway generation can keep the request open until the model reaches its token or context limit.

If responses are slow or very long

  • Lower temperature.
  • Set a stricter max token limit in your local server/client configuration.
  • Avoid huge context windows like 131072 unless you know you need them, because they can increase VRAM/RAM usage and slow down local inference.
  • Use a model that follows short translation instructions reliably.