Ollama Models Comparison – GLM-5 vs Qwen 3.5 vs Gemma 4

📝 Coding Agent Benchmark

Six comprehensive tests evaluating code generation, bug fixing, code review, refactoring, test writing, and architecture design.

🏆 Overall Winner

Qwen 3.5

256.4s for full suite

⚡ Fastest Code Gen

Gemma 4

19.7s for code generation

🐢 Slowest Overall

GLM-5

808.9s (3.1x slower than Qwen)

💡 Best for Agents

Qwen 3.5

Fast + quality = best balance

Test	Qwen 3.5	Gemma 4	GLM-5
1. Code Generation	75.1s	19.7s 🏆	72.9s
2. Bug Fixing	13.9s 🏆	84.0s	171.7s
3. Code Review	19.0s 🏆	48.8s	104.2s
4. Refactoring	28.2s 🏆	25.1s	92.0s
5. Test Writing	49.8s 🏆	101.0s	167.2s
6. Architecture	70.4s 🏆	78.3s	201.2s

📊 Reports & Files

📄 Full Report (Markdown) 📕 Full Report (PDF)

Test Results

✅ Verdict for OpenClaw

Keep Qwen 3.5 as primary — it's the best all-rounder for agentic workflows. Speed + quality is unbeatable. GLM-5 is powerful but 3x slower, making it unsuitable for real-time agents. Use GLM-5 for offline batch analysis only.

🏢 App Modernisation Benchmark

Six complex tests: PL/SQL rules extraction, Java JEE documentation, OpenAPI spec generation, .NET forward engineering, integration design, and test specifications.

🏆 Overall Winner

Qwen 3.5

324.0s for full suite

📈 Output Depth

Qwen 3.5

8,574 total words (vs GLM 7,490)

⚡ Fastest Task

Qwen 3.5

18.1s on Java JEE docs

🐢 Slowest Task

GLM-5

318.2s on integration design

Test	Qwen 3.5	Gemma 4	GLM-5
1. PL/SQL Rules Extraction	19.8s 🏆	47.2s	161.4s
2. Java JEE Documentation	18.1s 🏆	48.3s	109.1s
3. OpenAPI Spec Generation	56.2s	101.5s	165.6s 🏆 (most detailed)
4. .NET Forward Engineering	114.4s	110.0s 🏆	144.7s
5. Integration Design	58.6s 🏆	102.3s	318.2s
6. Test Specification	56.9s 🏆	126.0s	106.9s

📊 Reports & Files

📄 Full Report (Markdown) 📕 Full Report (PDF)

Test Results

✅ Verdict for Modernisation

Keep Qwen 3.5 as primary — wins 4 of 6 tests with faster execution and comparable or better output depth. GLM-5 only excels at OpenAPI spec completeness and has one real edge: catching subtle logic gaps in legacy code. For production workflows, Qwen is the clear choice.