Ornith-1.0: Coding Models That Train Their Own Agent Scaffolds
By optimizing both the reasoning loop and the code output, these MIT-licensed models bring native agentic capabilities to local hardware.
Building an autonomous coding agent usually involves a familiar, slightly frustrating architecture. You take a state-of-the-art foundation model, wrap it in a complex Python framework to handle tool execution, write fragile state machines to parse its outputs, and construct elaborate prompt templates to keep it from losing its way. The model remains a static engine, while the surrounding framework, or scaffold, does the heavy lifting of managing the agentic loop.
The newly released Ornith-1.0 family of models flips this paradigm. Developed by deepreinforce-ai and released under an MIT license, Ornith-1.0 is a suite of self-improving open-source models designed specifically for agentic coding. Instead of treating the model and the agent scaffold as separate layers, Ornith uses reinforcement learning (RL) to jointly optimize both the code solutions and the scaffolding that drives those rollouts.
By training the model to discover its own search trajectories, Ornith-1.0 internalizes the agentic loop. The result is a family of models (ranging from a dense 9B parameter model to a massive 397B Mixture-of-Experts) that achieve state-of-the-art results on agentic benchmarks while running on standard open-source runtimes.
The Co-Training Paradigm: Optimizing the Scaffold
Traditional reinforcement learning for code generation typically focuses on the final output. The model is rewarded if the generated code passes a test suite or matches a reference solution. While this improves raw syntax generation, it does little to help the model navigate complex, multi-step tasks like debugging a repository or interacting with a terminal.
Ornith-1.0 addresses this by using RL to optimize the entire trajectory. During training, the model learns to generate both the solution rollouts and the scaffold (the intermediate reasoning steps, tool calls, and self-correction loops) that leads to those solutions. By jointly optimizing these two elements, the model learns how to search, when to execute a command, and how to recover when a tool returns an error.
This approach directly addresses the fragility of hand-coded agent frameworks. When a model has internalized the state transitions of a debugging loop, it is less likely to get stuck in repetitive generation cycles or fail due to minor parsing discrepancies.
Benchmark Performance
The practical impact of this training methodology is evident in the benchmarks. Evaluated against size-appropriate baselines using standard harnesses like OpenHands and mini-SWE-agent, the Ornith models consistently outperform their base architectures (Qwen 3.5 and Gemma 4) on complex agentic tasks.
On SWE-bench Verified, which measures a model's ability to resolve real-world GitHub issues, the Ornith-1.0-35B MoE model scores 75.6%, outperforming both Qwen3.5-35B (70%) and Gemma4-31B (52%).
xychart-beta
title "SWE-bench Verified Scores (35B Tier)"
x-axis [Gemma4-31B, Qwen3.5-35B, Qwen3.6-35B, Ornith-1.0-35B]
y-axis "Score (%)" 0 --> 100
bar [52.0, 70.0, 73.4, 75.6]
At the high end, the Ornith-1.0-397B MoE model achieves 82.4% on SWE-bench Verified and 62.2% on SWE-bench Pro. It also performs exceptionally well on Terminal-Bench 2.1, scoring 77.5% under the Harbor/Terminus-2 framework and 78.2% using Claude Code, placing it in direct competition with proprietary frontier models.
Developer Angle: Deploying and Interfacing with Ornith
For developers looking to integrate Ornith-1.0 into their workflows, the models are highly accessible due to their standard open-source foundations. They support a 256K context window and are published in several precision formats, including FP8 and GGUF.
Hardware Sizing
- Ornith-1.0-9B (Dense): Fits comfortably on a single 80GB GPU in BF16. A GGUF version is available for local inference via Ollama or llama.cpp.
- Ornith-1.0-35B (MoE): Available in BF16 and FP8. The FP8 variant cuts VRAM requirements roughly in half, making it viable for multi-GPU setups on a single node.
- Ornith-1.0-397B (MoE): Designed for multi-GPU nodes using tensor parallelism.
Serving with vLLM
Because Ornith-1.0 is a reasoning model, it outputs a <think> ... </think> block before delivering its final answer. Modern inference engines can parse this structure natively. To serve the model, you will need recent runtime versions: Transformers >= 5.8.1, vLLM >= 0.19.1, or SGLang >= 0.5.9.
When deploying via vLLM, the engine's reasoning parser separates the chain-of-thought into a dedicated reasoning_content field in the API response. Similarly, the tool-call parser translates the model's internal <tool_call> blocks into standard OpenAI-compatible tool calls.
You can spin up an OpenAI-compatible server using vLLM with the following configuration:
python -m vllm.entrypoints.openai.api_server \
--model deepreinforce-ai/Ornith-1.0-35B-FP8 \
--served-model-name Ornith-1.0 \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--enable-reasoning \
--enable-tool-call-parser
For optimal performance, the recommended sampling parameters are temperature=0.6, top_p=0.95, and top_k=20. If you are trying to replicate the exact benchmark figures, you should increase the temperature to 1.0.
The Trade-offs of Internalized Scaffolding
While co-training the scaffold and the model yields impressive benchmark gains, it introduces new trade-offs for system architects.
First, debugging an agent's behavior becomes more complex. When using a traditional Python-based framework, you can inspect the state machine, set breakpoints, and explicitly modify the transition logic. With Ornith, because the search trajectory and tool-use strategies are baked into the model's weights via RL, altering the agent's core behavior requires fine-tuning or highly specific system prompting rather than a quick code change in your orchestration layer.
Second, the computational overhead of reasoning models is high. The 256K context window is incredibly useful for digesting entire codebases, but processing long contexts alongside deep chain-of-thought reasoning blocks demands significant GPU memory and increases time-to-first-token latency. Developers will need to carefully balance the depth of the model's reasoning steps against the latency requirements of interactive development environments.
Despite these caveats, Ornith-1.0 represents a clear shift in how we build coding agents. By moving the complexity of the agentic loop out of fragile runtime code and into the model's learned weights, it paves the way for more resilient, self-contained autonomous developers.
Sources & further reading
Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.
Discussion 0
No comments yet
Be the first to weigh in.