background
EvolutionStrategiesAnalysis
EvolutionStrategiesAnalysis
Evolution Strategies Analysis

Overview

We tested two fundamental evolution strategies: diff-based (incremental changes) and full rewrite (complete regeneration). Additionally, we discovered that parallelism is not just an optimization but a critical requirement.

Serial vs Parallel Comparison

Evolution Strategy Comparison

Diff-Based Evolution

Approach: Generate incremental changes to existing code

- old_code
+ new_code

Advantages:

  • Preserves working logic
  • Gradual refinement
  • Easier to track changes
  • Lower chance of breaking

Disadvantages:

  • Requires understanding of diffs
  • Can get stuck in local optima
  • Harder for weak models

Full Rewrite Evolution

Approach: Generate complete new implementation

def solve(problem):
    # Entirely new implementation

Advantages:

  • Can make radical improvements
  • Escape local optima
  • Simpler for models
  • Fresh perspective each time

Disadvantages:

  • May lose working optimizations
  • Higher chance of syntax errors
  • More compute intensive

Model-Specific Results

Strong Coding Models

Gemini Flash 2.5

  • Full rewrite: ~1.4x (estimated from Flash Lite scaling)
  • Diff-based: 1.637x ✓
  • Improvement: +17% with diffs

Example Successful Diff:

# Iteration 45 - PSD Projection optimization
- eigenvalues[eigenvalues < 0] = 0
- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Vectorized operation eliminates intermediate array
+ A_psd = (eigenvectors * np.maximum(eigenvalues, 0)) @ eigenvectors.T

Clear understanding of optimization opportunity!

Qwen3-Coder (480B MoE)

  • Full rewrite: 1.093x
  • Diff-based: 1.414x ✓
  • Improvement: +29% with diffs

Example Successful Diff:

# Iteration 67 - Graph algorithm optimization
- def dfs(node):
-     visited[node] = True
-     for neighbor in adj[node]:
-         if not visited[neighbor]:
-             dfs(neighbor)
+ # Switch to iterative approach
+ stack = [node]
+ while stack:
+     curr = stack.pop()
+     if not visited[curr]:
+         visited[curr] = True
+         stack.extend(adj[curr])

Algorithmic improvement with correct diff syntax!

Weak/Small Models

Gemini Flash 2.5 Lite

  • Full rewrite: 1.10x ✓
  • Diff-based: 0.793x ✗
  • Degradation: -28% with diffs!

Example Failed Diff Attempts:

# Iteration 23 - Attempted optimization
- result = np.dot(A, B)
+ result = np.dot(A, B)  # TODO: optimize this
# Iteration 45 - Confused diff
- for i in range(n):
+ for i in range(n):  # This loop could be faster
+     # Maybe use numpy?

Model adds comments instead of actual optimizations!

Serial vs Parallel Evolution

The Catastrophic Failure of Serial Evaluation

We tested running tasks sequentially vs in parallel:

Parallel Evaluation (Standard)

How it works:

  • All 30 tasks evolve simultaneously
  • Cross-pollination between tasks
  • Shared discoveries
  • Load balanced across workers

Results:

  • Flash Lite (diff): 0.793x in 0.9 hours
  • Flash Lite (full): 1.10x in 0.9 hours

Serial Evaluation

How it works:

  • Tasks evolve one at a time
  • No cross-learning
  • Sequential processing
  • Single worker thread

Results:

  • Flash Lite (diff): 0.396x in 13.0 hours (50% worse!)
  • Flash Lite (full): 0.585x in 13.1 hours (47% worse!)

Why Serial Fails So Badly

  1. No Cross-Task Learning

    • Optimization discovered for task A can't help task B
    • Each task starts from scratch
    • Lost synergies
  2. Compound Timeout Issues

    Task 1: 26 minutes (some timeouts)
    Task 2: 28 minutes (more timeouts)
    ...
    Task 30: 35 minutes (cascading delays)
    Total: 13+ hours!
    
  3. Evolution Gets Stuck

    • Hard tasks block progress
    • No parallel exploration
    • Single failure point
  4. Resource Inefficiency

    • CPU mostly idle
    • No pipeline benefits
    • Wasted evaluation cycles

Parallel Discovery Example

Count Connected Components optimization discovered in parallel:

  • Task A: Discovers BFS is faster than DFS
  • Task B: Simultaneously discovers deque optimization
  • Task C: Finds early termination strategy
  • Combined: All three insights merge → 95x speedup!

In serial: Each discovery happens in isolation, rarely combining.

Strategy Selection Guidelines

Decision Tree

Is the model coding-specialized?
├─ Yes: Is it large/capable (>30B)?
│   ├─ Yes: Use diff-based evolution
│   └─ No: Test both, likely full rewrite
└─ No: Use full rewrite

Always use parallel evaluation!

Model Capability Indicators

Signs model can handle diffs:

  • Generates syntactically correct code consistently
  • Shows understanding of code structure
  • Makes meaningful incremental changes
  • Preserves function signatures

Signs model needs full rewrites:

  • Generates broken diffs
  • Adds comments instead of code
  • Loses track of changes
  • Inconsistent formatting

Best Practices

For Diff-Based Evolution

  1. Clear Prompts: Show exact diff format
  2. Good Examples: Include successful diffs
  3. Incremental: Encourage small changes
  4. Validation: Check diff applies cleanly

For Full Rewrite Evolution

  1. Preserve Interface: Keep function signature
  2. Maintain Correctness: Emphasize working code
  3. Fresh Perspective: Don't show too much context
  4. Radical Changes: Encourage new approaches

For Parallel Evaluation

  1. Sufficient Workers: At least 4-8 parallel
  2. Balanced Load: Distribute tasks evenly
  3. Shared Database: Enable cross-learning
  4. Migration: Allow program sharing

Evolution Patterns Observed

Successful Evolution Trajectory (Diff-Based)

Iteration 1-20: Syntax cleanup, minor optimizations
Iteration 21-40: Algorithm improvements emerge  
Iteration 41-60: Vectorization, library optimizations
Iteration 61-80: Fine-tuning, edge cases
Iteration 81-100: Convergence, minor tweaks

Failed Evolution Trajectory (Wrong Strategy)

Iteration 1-20: Broken changes, syntax errors
Iteration 21-40: Attempts to fix previous errors
Iteration 41-60: Giving up, adding comments
Iteration 61-80: Random changes, no progress
Iteration 81-100: Degraded below baseline

Key Experimental Findings

  1. Model-Strategy Alignment

    • Strong coding models performed better with diff-based evolution
    • Weaker models needed full rewrites to make progress
    • Flash Lite: 0.79x with diffs vs 1.10x with rewrites
  2. Serial Evaluation Impact

    • Parallel evaluation achieved 0.793x-1.10x performance
    • Serial evaluation degraded to 0.396x-0.585x (47-50% worse)
    • Time increased from 0.9 hours to 13+ hours
  3. Evolution Health Indicators

    • Diff-based evolution showed lower syntax error rates with capable models
    • Full rewrite led to more exploration but higher failure rates
    • Comment inflation occurred more frequently with full rewrites
  4. Strategy Switching Observations

    • Some experiments benefited from checkpoint-based strategy changes
    • Switching strategies mid-run allowed recovery from poor trajectories
    • Consistent strategy often outperformed switching

Future Investigation Areas

  1. Hybrid Evolution Approaches

    • Combining rewrite and diff strategies not yet tested
    • Alternating strategies based on performance plateaus
    • Task-specific strategy selection remains unexplored
  2. Parallelism Optimization

    • Dynamic worker allocation
    • Priority-based scheduling
    • Intelligent task grouping
  3. Strategy Learning

    • Predict best strategy from model
    • Learn from early iterations
    • Auto-switch on failure detection
Join the community