Overview
We tested 7 different model families across 28 experiments to understand how model architecture, size, and specialization impact code evolution performance.

Model Specifications
Model | Type | Parameters | Specialization | Best Config |
---|
Gemini Flash 2.5 | Efficient | ~70B | Coding-optimized | 200 iter, diff, temp 0.4 |
Gemini Flash 2.5 Lite | Small | ~10B | General | 100 iter, full, temp 0.4 |
Qwen3-Coder | MoE | 480B (35B active) | Coding-focused | 100 iter, diff, temp 0.6 |
Qwen3-235B | Dense | 235B | General-purpose | 100 iter, full |
Qwen3-32B | Medium | 32B | General | 100 iter, optimized |
Gemma 3 27B | Open | 27B | General | 100 iter, diff |
Llama 3.3 70B | Open | 70B | General | 100 iter, diff |
Performance Rankings
Overall AlgoTune Scores
- Gemini Flash 2.5 (200 iter): 2.039x
- Gemini Flash 2.5 (100 iter): 1.637x
- Gemma 3 27B: 1.630x
- Qwen3-Coder 480B: 1.414x
- Qwen3-32B: 1.306x
- Gemini Flash Lite: 1.291x (best config)
- Qwen3-235B: 0.836x
Key Finding: Specialization > Size
Case Study: Qwen3-Coder vs Qwen3-235B
Despite Qwen3-235B having all parameters active (235B) vs Qwen3-Coder's 35B active parameters:
Qwen3-Coder (480B MoE, 35B active):
- AlgoTune Score: 1.414x
- Best task: psd_cone_projection (41.9x)
- Strengths: Understanding code patterns, generating valid diffs
Qwen3-235B (235B dense):
- AlgoTune Score: 0.836x (WORSE than baseline!)
- Best task: count_connected_components (34.5x)
- Weaknesses: Poor code generation, invalid syntax
Why Specialization Wins:
- Training data: Qwen3-Coder trained on code repositories
- Architecture: MoE allows specialized experts for different code patterns
- Objective: Optimized for code understanding and generation
Evolution Strategy Analysis
Diff-Based vs Full Rewrite Performance
Model | Diff-Based | Full Rewrite | Best Strategy |
---|
Gemini Flash 2.5 | 1.637x ✓ | ~1.4x | Diff |
Gemini Flash Lite | 0.786x ✗ | 1.096x ✓ | Full |
Qwen3-Coder | 1.414x ✓ | Lower | Diff |
Gemma 3 | 1.630x ✓ | N/A | Diff |
Example: Why Flash Lite Fails with Diffs
Attempted Diff (iteration 34):
- for i in range(n):
- if not visited[i]:
- dfs(i)
- count += 1
+ for i in range(n):
+ if not visited[i]:
+ # TODO: optimize this
+ dfs(i)
+ count += 1
No actual optimization, just added comment!
Successful Full Rewrite:
# Complete reimplementation
def solve(problem):
# New BFS approach
from collections import deque
# ... full new implementation
Model-Specific Strengths
Gemini Flash 2.5
Strengths:
- Excellent at algorithmic optimization
- Finds creative solutions
- Strong diff generation
Example Achievement:
- count_connected_components: 95.78x speedup
- Discovered BFS optimization in iteration 2
Qwen3-Coder
Strengths:
- Deep understanding of data structures
- Excellent at API optimization
- Consistent improvements
Example Achievement:
- Discovered Union-Find algorithm for graph problems
- Consistent numpy optimizations
Gemma 3
Strengths:
- Balanced performance
- Good generalization
- Efficient for its size
Surprise Result: Nearly matched larger models (1.63x)
Task Performance Patterns
Tasks Where Models Converge
DCT/FFT Operations:
- Most models found
dtype=np.float64
optimization
- Similar 6-7x speedups across models
- Limited optimization space
Example (Multiple models):
# All discovered this pattern
signal = np.array(problem["signal"], dtype=np.float64)
Tasks Where Models Diverge
Count Connected Components:
- Gemini: BFS with deque (95.78x)
- Qwen3-Coder: Union-Find (25x)
- Flash Lite: Minor improvements (3x)
Different algorithmic approaches!
Computational Efficiency
Model | Avg Time/Task | Performance/Time Ratio |
---|
Gemini Flash 2.5 | 165s | High |
Flash Lite | 107s | Medium |
Qwen3-Coder | 590s | Medium |
Qwen3-235B | 336s | Low |
Gemma 3 | 433s | High |
Convergence Patterns
Fast Learners
Gemini Flash 2.5: Found optimal solutions quickly
- Iteration 2: Major improvement
- Iteration 50: Near final performance
- Iteration 100+: Refinements
Steady Improvers
Qwen3-Coder: Gradual optimization
- Iteration 25: Initial improvements
- Iteration 50: Significant gains
- Iteration 99: Still finding optimizations
Unstable Evolution
Qwen3-235B: Erratic performance
- Multiple regressions
- Failed to maintain improvements
- General-purpose training shows
Ensemble Performance
Gemini + Qwen Ensemble:
- AlgoTune Score: 1.226x
- Between individual model performances
- Not simply average - some synergy observed
- Future opportunity: Model-to-island assignment
Model Performance by Use Case
Production Optimization Results
Gemini Flash 2.5 Performance:
- Achieved 2.04x speedup with 200 iterations
- 1.64x speedup with 100 iterations
- Fastest inference among top performers
Open Model Performance
Gemma 3 27B Results:
- Achieved 1.63x speedup (matching Flash 2.5 at 100 iter)
- Surprised with strong diff-based evolution capability
- Consistent performance across tasks
Specialized Code Model Behavior
Qwen3-Coder 480B Observations:
- 1.41x overall speedup
- Found unique optimizations (Union-Find vs BFS)
- Strong performance on specific algorithmic tasks
Lightweight Model Testing
Flash Lite Results:
- 1.29x best performance with optimal parameters
- Required full rewrite strategy (0.79x with diffs)
- Suitable for parameter experimentation
Conclusion
Model selection dramatically impacts evolution success. Specialized coding models outperform larger general models, and evolution strategy must match model capabilities. For best results, use Gemini Flash 2.5 with diff-based evolution and 200 iterations, achieving over 2x average speedup with individual tasks reaching nearly 100x improvement.