background
ModelComparisonAnalysis
ModelComparisonAnalysis
Model Comparison Analysis

Overview

We tested 7 different model families across 28 experiments to understand how model architecture, size, and specialization impact code evolution performance.

Model Performance Heatmap

Model Specifications

ModelTypeParametersSpecializationBest Config
Gemini Flash 2.5Efficient~70BCoding-optimized200 iter, diff, temp 0.4
Gemini Flash 2.5 LiteSmall~10BGeneral100 iter, full, temp 0.4
Qwen3-CoderMoE480B (35B active)Coding-focused100 iter, diff, temp 0.6
Qwen3-235BDense235BGeneral-purpose100 iter, full
Qwen3-32BMedium32BGeneral100 iter, optimized
Gemma 3 27BOpen27BGeneral100 iter, diff
Llama 3.3 70BOpen70BGeneral100 iter, diff

Performance Rankings

Overall AlgoTune Scores

  1. Gemini Flash 2.5 (200 iter): 2.039x
  2. Gemini Flash 2.5 (100 iter): 1.637x
  3. Gemma 3 27B: 1.630x
  4. Qwen3-Coder 480B: 1.414x
  5. Qwen3-32B: 1.306x
  6. Gemini Flash Lite: 1.291x (best config)
  7. Qwen3-235B: 0.836x

Key Finding: Specialization > Size

Case Study: Qwen3-Coder vs Qwen3-235B

Despite Qwen3-235B having all parameters active (235B) vs Qwen3-Coder's 35B active parameters:

Qwen3-Coder (480B MoE, 35B active):

  • AlgoTune Score: 1.414x
  • Best task: psd_cone_projection (41.9x)
  • Strengths: Understanding code patterns, generating valid diffs

Qwen3-235B (235B dense):

  • AlgoTune Score: 0.836x (WORSE than baseline!)
  • Best task: count_connected_components (34.5x)
  • Weaknesses: Poor code generation, invalid syntax

Why Specialization Wins:

  1. Training data: Qwen3-Coder trained on code repositories
  2. Architecture: MoE allows specialized experts for different code patterns
  3. Objective: Optimized for code understanding and generation

Evolution Strategy Analysis

Diff-Based vs Full Rewrite Performance

ModelDiff-BasedFull RewriteBest Strategy
Gemini Flash 2.51.637x ✓~1.4xDiff
Gemini Flash Lite0.786x ✗1.096x ✓Full
Qwen3-Coder1.414x ✓LowerDiff
Gemma 31.630x ✓N/ADiff

Example: Why Flash Lite Fails with Diffs

Attempted Diff (iteration 34):

- for i in range(n):
-     if not visited[i]:
-         dfs(i)
-         count += 1
+ for i in range(n):
+     if not visited[i]:
+         # TODO: optimize this
+         dfs(i)
+         count += 1

No actual optimization, just added comment!

Successful Full Rewrite:

# Complete reimplementation
def solve(problem):
    # New BFS approach
    from collections import deque
    # ... full new implementation

Model-Specific Strengths

Gemini Flash 2.5

Strengths:

  • Excellent at algorithmic optimization
  • Finds creative solutions
  • Strong diff generation

Example Achievement:

  • count_connected_components: 95.78x speedup
  • Discovered BFS optimization in iteration 2

Qwen3-Coder

Strengths:

  • Deep understanding of data structures
  • Excellent at API optimization
  • Consistent improvements

Example Achievement:

  • Discovered Union-Find algorithm for graph problems
  • Consistent numpy optimizations

Gemma 3

Strengths:

  • Balanced performance
  • Good generalization
  • Efficient for its size

Surprise Result: Nearly matched larger models (1.63x)

Task Performance Patterns

Tasks Where Models Converge

DCT/FFT Operations:

  • Most models found dtype=np.float64 optimization
  • Similar 6-7x speedups across models
  • Limited optimization space

Example (Multiple models):

# All discovered this pattern
signal = np.array(problem["signal"], dtype=np.float64)

Tasks Where Models Diverge

Count Connected Components:

  • Gemini: BFS with deque (95.78x)
  • Qwen3-Coder: Union-Find (25x)
  • Flash Lite: Minor improvements (3x)

Different algorithmic approaches!

Computational Efficiency

ModelAvg Time/TaskPerformance/Time Ratio
Gemini Flash 2.5165sHigh
Flash Lite107sMedium
Qwen3-Coder590sMedium
Qwen3-235B336sLow
Gemma 3433sHigh

Convergence Patterns

Fast Learners

Gemini Flash 2.5: Found optimal solutions quickly

  • Iteration 2: Major improvement
  • Iteration 50: Near final performance
  • Iteration 100+: Refinements

Steady Improvers

Qwen3-Coder: Gradual optimization

  • Iteration 25: Initial improvements
  • Iteration 50: Significant gains
  • Iteration 99: Still finding optimizations

Unstable Evolution

Qwen3-235B: Erratic performance

  • Multiple regressions
  • Failed to maintain improvements
  • General-purpose training shows

Ensemble Performance

Gemini + Qwen Ensemble:

  • AlgoTune Score: 1.226x
  • Between individual model performances
  • Not simply average - some synergy observed
  • Future opportunity: Model-to-island assignment

Model Performance by Use Case

Production Optimization Results

Gemini Flash 2.5 Performance:

  • Achieved 2.04x speedup with 200 iterations
  • 1.64x speedup with 100 iterations
  • Fastest inference among top performers

Open Model Performance

Gemma 3 27B Results:

  • Achieved 1.63x speedup (matching Flash 2.5 at 100 iter)
  • Surprised with strong diff-based evolution capability
  • Consistent performance across tasks

Specialized Code Model Behavior

Qwen3-Coder 480B Observations:

  • 1.41x overall speedup
  • Found unique optimizations (Union-Find vs BFS)
  • Strong performance on specific algorithmic tasks

Lightweight Model Testing

Flash Lite Results:

  • 1.29x best performance with optimal parameters
  • Required full rewrite strategy (0.79x with diffs)
  • Suitable for parameter experimentation

Conclusion

Model selection dramatically impacts evolution success. Specialized coding models outperform larger general models, and evolution strategy must match model capabilities. For best results, use Gemini Flash 2.5 with diff-based evolution and 200 iterations, achieving over 2x average speedup with individual tasks reaching nearly 100x improvement.

Join the community