AI Article

GLM 5.2 Is a Point Behind Opus — Until the Task Runs for Hours

Open weights and a 5.7x price cut make the near-parity headline real, but the gap reopens exactly where long-horizon agents live.

Mariana Souza

Senior Editor · Jun 22, 2026 · 8 min read

GLM 5.2 Is a Point Behind Opus — Until the Task Runs for Hours

"Within a point of Opus." That's the line ricocheting around the model-selection chatter since Z.ai put GLM 5.2 — a 753B-parameter MoE under an MIT license — up against Anthropic's Claude Opus 4.8. And on the right benchmark, it's true: GLM 5.2 trails Opus 4.8 by 0.7 points on FrontierSWE (74.4 vs 75.1) while charging 5.7x less for output tokens. For a closed-frontier crown to be threatened by downloadable weights is a genuine shift, not hype.

But "within a point" is an average that hides the shape of the curve. Read the full scorecard sorted by margin and a clean pattern emerges: GLM 5.2 is at or above Opus on competition math and one terminal harness, dead even on short agentic evals — and then the gap widens steadily as tasks get longer and more open-ended. The headline sells near-parity precisely for the long-horizon coding-agent work GLM was built to target, and that's the one claim the numbers don't fully support yet. The honest read: GLM 5.2 is the best open-weights model anyone has shipped, an outstanding cheap workhorse, and not a drop-in Opus replacement for your hardest jobs.

Where parity holds — and where it breaks

Z.ai published GLM 5.2 against Opus 4.8 across 19 reasoning, coding, and agentic benchmarks. The deltas (GLM minus Opus, self-reported under matched harnesses) tell the real story:

GLM wins: IMOAnswerBench (+7.5), Terminal-Bench 2.1 best harness (+3.8), AIME 2026 (+3.5). Olympiad-grade math is genuinely at or above the frontier — AIME 2026 lands 99.2 vs 95.7.
Effectively tied: FrontierSWE (−0.7), MCP-Atlas (−1.0).
Opus pulls away as tasks lengthen: SWE-bench Pro (−7.1), Tool-Decathlon (−11.7), DeepSWE (−11.8), SWE-Marathon (−13.0), NL2Repo (−20.8).

That ordering is the whole argument. The evals GLM loses by double digits — SWE-Marathon, NL2Repo, DeepSWE — are the multi-hour, repository-scale, plan-and-hold-it-together tasks. The independent corroboration lines up: CodingFleet reports Opus at 69.2% on SWE-bench Pro to GLM's 62.1%, the same ~7-point gap. On llm-stats' read, Opus 4.8 "wins most benchmarks, with its largest margins on multi-hour software engineering and tool-use tasks."

One data caveat worth stating plainly: these scores are self-reported by Z.ai under harnesses it chose. Treat them as a vendor's best case, not a referee's verdict. The one external anchor — the Artificial Analysis Intelligence Index v4.1 — puts GLM 5.2 at 51, top open-weight model and 5th overall, which is consistent with "frontier-adjacent, not frontier-leading."

The one-shot build that the benchmarks can't fake

The most useful evidence isn't a leaderboard. One reproducible head-to-head (source on GitHub) handed both models the same one-shot prompt: build a 3D platformer in raw WebGL — no Three.js, no engine — a GLB parser, matrix/quaternion math, GLSL skinning shaders, substepped AABB collision, a follow camera. This is a good probe because it tests both things people argue about at once: holding a layered, multi-file build together (the agentic part) and getting engine internals right that look fine but quietly break (the reasoning-and-taste part).

Both shipped a playable game. The deltas are the point:

Metric	GLM 5.2 (Pi/OpenRouter)	Opus 4.8 (Claude Code)
Wall-clock	1h 10m 40s	33m 30s
Output tokens	131,000	216,809
Tool calls	128	153
Cost	$5.39 (billed)	~$21.92 (est. list)

GLM cost roughly a quarter as much. Opus finished in half the time and, per the testers, shipped a cleaner, more correct game. The most telling detail isn't in the table: Opus could check its own visual output, and GLM couldn't — because GLM 5.2 is text-only. When you're debugging a renderer, being able to look at the frame you just produced is not a nice-to-have. That single architectural fact decides a whole category of work.

The developer angle: this is a routing decision, not a swap

Stop framing this as "which model do I switch to." The economics and the capability curve point at a router, and the dividing lines are concrete.

Send to GLM 5.2: high-volume, well-scoped agent turns — codegen for Python and common web frameworks, structured JSON/function-calling loops, refactors that fit in a single large diff (the 131K-token output cap regenerates very large files in one pass), bulk eval runs, anything CI-driven. Output tokens dominate an agent's bill, and that's where the 5.7x gap bites. The often-cited illustration: a workload that's ~$1,000/day on Opus output lands near $176/day on GLM. Crucially, the $1.40/$4.40 rate holds across Z.ai, Novita, and Friendli, so it's a property of the model, not one vendor's promo.

Keep on Opus 4.8: multi-hour autonomous runs over a large repo, novel-strategy planning, and anything touching images — screenshots, PDFs, UI state, or a model inspecting its own rendered output. Those are Opus-only jobs today.

The open-weights dividend is the part no benchmark captures. MIT weights on Hugging Face and ModelScope, servable on vLLM, SGLang, KTransformers, or Transformers, mean you can fine-tune, quantize, run air-gapped, and pin a version forever. For regulated data that can't leave your network, that outweighs the entire benchmark table. And it's an insurance policy against the closed-model failure mode — a provider deprecating or restricting a model you built a product on. Weights you've downloaded can't be retired out from under you.

Two honest caveats before you commit budget. First, GLM 5.2's headline is a usable 1M-token context held together by an architecture change (Z.ai calls it IndexShare); that's a training claim that's hard to verify from a spec sheet, so stress it on your own long trajectories before trusting it. Second, watch the noise: source-to-source numbers don't all agree — one comparison lists GLM at a 128K context and ~$2/$6 pricing, which contradicts the 1M context and $1.40/$4.40 that the vendor docs and most independent write-ups report. When the figures diverge, trust the vendor pricing page and your own eval harness, not the aggregators.

The verdict

GLM 5.2 doesn't dethrone Opus 4.8, and the "within a point" framing flatters it on exactly the long-horizon agentic work it was marketed for. What it does is more interesting: it makes the closed frontier look expensive without making it look slow, and it makes openness a first-class selection criterion again. The rational 2026 stack isn't one model — it's a cheap, self-hostable workhorse handling the bulk of well-scoped turns, with a premium closed model reserved for the hardest multi-hour reasoning and every vision task. Run both against your own representative tasks; a one-point benchmark delta never settled a real workload, and your codebase isn't SWE-bench. The frontier is still closed. The floor underneath it just dropped by 5.7x.

Sources & further reading

GLM 5.2 vs. Opus — techstackups.com
GLM-5.2 vs Claude Opus 4.8: Full Comparison — llm-stats.com
GLM 5.2 vs GPT 5.5 vs Claude Opus 4.8: Which Model Wins for Agentic Workflows? | MindStudio — mindstudio.ai
GLM 5.2 vs Claude Opus 4.8 vs GPT-5.5 | Lushbinary — lushbinary.com

#Llm #Ai Coding #Agentic Workflows #Glm 5.2 #Open Weights #Claude Opus

Written by

Mariana Souza · Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

Discussion 2

Join the discussion

Iris Lund @designer_iris · 1 week ago

i love that they're highlighting the price cut, but i'm more curious about the ux implications of these models - how do the differing performance curves impact the user experience, especially for those long-horizon tasks?

Dmitri Sokolov @ai_doomer_dmitri · 1 week ago

totally with you on that @designer_iris, the ux implications are where things get really interesting - and potentially concerning, since those long-horizon tasks are exactly where we start to see the gap reopen between glm 5.2 and opus, which could lead to some unexpected - and potentially unsafe - behavior 🤔

GLM 5.2 Is a Point Behind Opus — Until the Task Runs for Hours

Where parity holds — and where it breaks

The one-shot build that the benchmarks can't fake

The developer angle: this is a routing decision, not a swap

The verdict

Sources & further reading

Discussion 2

Related Reading

Ornith-1.0: Coding Models That Train Their Own Agent Scaffolds

Qwen 3.6 27B Hits the Local Development Sweet Spot

Google's design.md: A Spec to Stop Agents Writing Ugly UI

How a Database Schema Error Triggered an Expensive AI Retry Storm