Model Comparison Analysis

Overview

We tested 7 different model families across 28 experiments to understand how model architecture, size, and specialization impact code evolution performance.

Model Performance Heatmap

Model Specifications

Model	Type	Parameters	Specialization	Best Config
Gemini Flash 2.5	Efficient	~70B	Coding-optimized	200 iter, diff, temp 0.4
Gemini Flash 2.5 Lite	Small	~10B	General	100 iter, full, temp 0.4
Qwen3-Coder	MoE	480B (35B active)	Coding-focused	100 iter, diff, temp 0.6
Qwen3-235B	Dense	235B	General-purpose	100 iter, full
Qwen3-32B	Medium	32B	General	100 iter, optimized
Gemma 3 27B	Open	27B	General	100 iter, diff
Llama 3.3 70B	Open	70B	General	100 iter, diff

Performance Rankings

Overall AlgoTune Scores

Gemini Flash 2.5 (200 iter): 2.039x
Gemini Flash 2.5 (100 iter): 1.637x
Gemma 3 27B: 1.630x
Qwen3-Coder 480B: 1.414x
Qwen3-32B: 1.306x
Gemini Flash Lite: 1.291x (best config)
Qwen3-235B: 0.836x

Key Finding: Specialization > Size

Case Study: Qwen3-Coder vs Qwen3-235B

Despite Qwen3-235B having all parameters active (235B) vs Qwen3-Coder's 35B active parameters:

Qwen3-Coder (480B MoE, 35B active):

AlgoTune Score: 1.414x
Best task: psd_cone_projection (41.9x)
Strengths: Understanding code patterns, generating valid diffs

Qwen3-235B (235B dense):

AlgoTune Score: 0.836x (WORSE than baseline!)
Best task: count_connected_components (34.5x)
Weaknesses: Poor code generation, invalid syntax

Why Specialization Wins:

Training data: Qwen3-Coder trained on code repositories
Architecture: MoE allows specialized experts for different code patterns
Objective: Optimized for code understanding and generation

Evolution Strategy Analysis

Diff-Based vs Full Rewrite Performance

Model	Diff-Based	Full Rewrite	Best Strategy
Gemini Flash 2.5	1.637x ✓	~1.4x	Diff
Gemini Flash Lite	0.786x ✗	1.096x ✓	Full
Qwen3-Coder	1.414x ✓	Lower	Diff
Gemma 3	1.630x ✓	N/A	Diff

Example: Why Flash Lite Fails with Diffs

Attempted Diff (iteration 34):

- for i in range(n):
-     if not visited[i]:
-         dfs(i)
-         count += 1
+ for i in range(n):
+     if not visited[i]:
+         # TODO: optimize this
+         dfs(i)
+         count += 1

No actual optimization, just added comment!

Successful Full Rewrite:

# Complete reimplementation
def solve(problem):
    # New BFS approach
    from collections import deque
    # ... full new implementation

Model-Specific Strengths

Gemini Flash 2.5

Strengths:

Excellent at algorithmic optimization
Finds creative solutions
Strong diff generation

Example Achievement:

count_connected_components: 95.78x speedup
Discovered BFS optimization in iteration 2

Qwen3-Coder

Strengths:

Deep understanding of data structures
Excellent at API optimization
Consistent improvements

Example Achievement:

Discovered Union-Find algorithm for graph problems
Consistent numpy optimizations

Gemma 3

Strengths:

Balanced performance
Good generalization
Efficient for its size

Surprise Result: Nearly matched larger models (1.63x)

Task Performance Patterns

Tasks Where Models Converge

DCT/FFT Operations:

Most models found dtype=np.float64 optimization
Similar 6-7x speedups across models
Limited optimization space

Example (Multiple models):

# All discovered this pattern
signal = np.array(problem["signal"], dtype=np.float64)

Tasks Where Models Diverge

Count Connected Components:

Gemini: BFS with deque (95.78x)
Qwen3-Coder: Union-Find (25x)
Flash Lite: Minor improvements (3x)

Different algorithmic approaches!

Computational Efficiency

Model	Avg Time/Task	Performance/Time Ratio
Gemini Flash 2.5	165s	High
Flash Lite	107s	Medium
Qwen3-Coder	590s	Medium
Qwen3-235B	336s	Low
Gemma 3	433s	High

Convergence Patterns

Fast Learners

Gemini Flash 2.5: Found optimal solutions quickly

Iteration 2: Major improvement
Iteration 50: Near final performance
Iteration 100+: Refinements

Steady Improvers

Qwen3-Coder: Gradual optimization

Iteration 25: Initial improvements
Iteration 50: Significant gains
Iteration 99: Still finding optimizations

Unstable Evolution

Qwen3-235B: Erratic performance

Multiple regressions
Failed to maintain improvements
General-purpose training shows

Ensemble Performance

Gemini + Qwen Ensemble:

AlgoTune Score: 1.226x
Between individual model performances
Not simply average - some synergy observed
Future opportunity: Model-to-island assignment

Model Performance by Use Case

Production Optimization Results

Gemini Flash 2.5 Performance:

Achieved 2.04x speedup with 200 iterations
1.64x speedup with 100 iterations
Fastest inference among top performers

Open Model Performance

Gemma 3 27B Results:

Achieved 1.63x speedup (matching Flash 2.5 at 100 iter)
Surprised with strong diff-based evolution capability
Consistent performance across tasks

Specialized Code Model Behavior

Qwen3-Coder 480B Observations:

1.41x overall speedup
Found unique optimizations (Union-Find vs BFS)
Strong performance on specific algorithmic tasks

Lightweight Model Testing

Flash Lite Results:

1.29x best performance with optimal parameters
Required full rewrite strategy (0.79x with diffs)
Suitable for parameter experimentation

Conclusion

Model selection dramatically impacts evolution success. Specialized coding models outperform larger general models, and evolution strategy must match model capabilities. For best results, use Gemini Flash 2.5 with diff-based evolution and 200 iterations, achieving over 2x average speedup with individual tasks reaching nearly 100x improvement.