🚀 Ollama Models Benchmark

Comprehensive comparison of GLM-5, Qwen 3.5, and Gemma 4 cloud models for coding agents and app modernisation

Test Date: April 8-9, 2026

📝 Coding Agent Benchmark

Six comprehensive tests evaluating code generation, bug fixing, code review, refactoring, test writing, and architecture design.

🏆 Overall Winner

Qwen 3.5
256.4s for full suite

⚡ Fastest Code Gen

Gemma 4
19.7s for code generation

🐢 Slowest Overall

GLM-5
808.9s (3.1x slower than Qwen)

💡 Best for Agents

Qwen 3.5
Fast + quality = best balance
Test Qwen 3.5 Gemma 4 GLM-5
1. Code Generation 75.1s 19.7s 🏆 72.9s
2. Bug Fixing 13.9s 🏆 84.0s 171.7s
3. Code Review 19.0s 🏆 48.8s 104.2s
4. Refactoring 28.2s 🏆 25.1s 92.0s
5. Test Writing 49.8s 🏆 101.0s 167.2s
6. Architecture 70.4s 🏆 78.3s 201.2s

📊 Reports & Files

Test Results

✅ Verdict for OpenClaw

Keep Qwen 3.5 as primary — it's the best all-rounder for agentic workflows. Speed + quality is unbeatable. GLM-5 is powerful but 3x slower, making it unsuitable for real-time agents. Use GLM-5 for offline batch analysis only.

🏢 App Modernisation Benchmark

Six complex tests: PL/SQL rules extraction, Java JEE documentation, OpenAPI spec generation, .NET forward engineering, integration design, and test specifications.

🏆 Overall Winner

Qwen 3.5
324.0s for full suite

📈 Output Depth

Qwen 3.5
8,574 total words (vs GLM 7,490)

⚡ Fastest Task

Qwen 3.5
18.1s on Java JEE docs

🐢 Slowest Task

GLM-5
318.2s on integration design
Test Qwen 3.5 Gemma 4 GLM-5
1. PL/SQL Rules Extraction 19.8s 🏆 47.2s 161.4s
2. Java JEE Documentation 18.1s 🏆 48.3s 109.1s
3. OpenAPI Spec Generation 56.2s 101.5s 165.6s 🏆 (most detailed)
4. .NET Forward Engineering 114.4s 110.0s 🏆 144.7s
5. Integration Design 58.6s 🏆 102.3s 318.2s
6. Test Specification 56.9s 🏆 126.0s 106.9s

📊 Reports & Files

Test Results

✅ Verdict for Modernisation

Keep Qwen 3.5 as primary — wins 4 of 6 tests with faster execution and comparable or better output depth. GLM-5 only excels at OpenAPI spec completeness and has one real edge: catching subtle logic gaps in legacy code. For production workflows, Qwen is the clear choice.