Evolution Strategies Analysis

Overview

We tested two fundamental evolution strategies: diff-based (incremental changes) and full rewrite (complete regeneration). Additionally, we discovered that parallelism is not just an optimization but a critical requirement.

Serial vs Parallel Comparison

Evolution Strategy Comparison

Diff-Based Evolution

Approach: Generate incremental changes to existing code

- old_code
+ new_code

Advantages:

Preserves working logic
Gradual refinement
Easier to track changes
Lower chance of breaking

Disadvantages:

Requires understanding of diffs
Can get stuck in local optima
Harder for weak models

Full Rewrite Evolution

Approach: Generate complete new implementation

def solve(problem):
    # Entirely new implementation

Advantages:

Can make radical improvements
Escape local optima
Simpler for models
Fresh perspective each time

Disadvantages:

May lose working optimizations
Higher chance of syntax errors
More compute intensive

Model-Specific Results

Strong Coding Models

Gemini Flash 2.5

Full rewrite: ~1.4x (estimated from Flash Lite scaling)
Diff-based: 1.637x ✓
Improvement: +17% with diffs

Example Successful Diff:

# Iteration 45 - PSD Projection optimization
- eigenvalues[eigenvalues < 0] = 0
- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Vectorized operation eliminates intermediate array
+ A_psd = (eigenvectors * np.maximum(eigenvalues, 0)) @ eigenvectors.T

Clear understanding of optimization opportunity!

Qwen3-Coder (480B MoE)

Full rewrite: 1.093x
Diff-based: 1.414x ✓
Improvement: +29% with diffs

Example Successful Diff:

# Iteration 67 - Graph algorithm optimization
- def dfs(node):
-     visited[node] = True
-     for neighbor in adj[node]:
-         if not visited[neighbor]:
-             dfs(neighbor)
+ # Switch to iterative approach
+ stack = [node]
+ while stack:
+     curr = stack.pop()
+     if not visited[curr]:
+         visited[curr] = True
+         stack.extend(adj[curr])

Algorithmic improvement with correct diff syntax!

Weak/Small Models

Gemini Flash 2.5 Lite

Full rewrite: 1.10x ✓
Diff-based: 0.793x ✗
Degradation: -28% with diffs!

Example Failed Diff Attempts:

# Iteration 23 - Attempted optimization
- result = np.dot(A, B)
+ result = np.dot(A, B)  # TODO: optimize this

# Iteration 45 - Confused diff
- for i in range(n):
+ for i in range(n):  # This loop could be faster
+     # Maybe use numpy?

Model adds comments instead of actual optimizations!

Serial vs Parallel Evolution

The Catastrophic Failure of Serial Evaluation

We tested running tasks sequentially vs in parallel:

Parallel Evaluation (Standard)

How it works:

All 30 tasks evolve simultaneously
Cross-pollination between tasks
Shared discoveries
Load balanced across workers

Results:

Flash Lite (diff): 0.793x in 0.9 hours
Flash Lite (full): 1.10x in 0.9 hours

Serial Evaluation

How it works:

Tasks evolve one at a time
No cross-learning
Sequential processing
Single worker thread

Results:

Flash Lite (diff): 0.396x in 13.0 hours (50% worse!)
Flash Lite (full): 0.585x in 13.1 hours (47% worse!)

Why Serial Fails So Badly

No Cross-Task Learning
- Optimization discovered for task A can't help task B
- Each task starts from scratch
- Lost synergies

Compound Timeout Issues

Task 1: 26 minutes (some timeouts)
Task 2: 28 minutes (more timeouts)
...
Task 30: 35 minutes (cascading delays)
Total: 13+ hours!

Evolution Gets Stuck
- Hard tasks block progress
- No parallel exploration
- Single failure point
Resource Inefficiency
- CPU mostly idle
- No pipeline benefits
- Wasted evaluation cycles

Parallel Discovery Example

Count Connected Components optimization discovered in parallel:

Task A: Discovers BFS is faster than DFS
Task B: Simultaneously discovers deque optimization
Task C: Finds early termination strategy
Combined: All three insights merge → 95x speedup!

In serial: Each discovery happens in isolation, rarely combining.

Strategy Selection Guidelines

Decision Tree

Is the model coding-specialized?
├─ Yes: Is it large/capable (>30B)?
│   ├─ Yes: Use diff-based evolution
│   └─ No: Test both, likely full rewrite
└─ No: Use full rewrite

Always use parallel evaluation!

Model Capability Indicators

Signs model can handle diffs:

Generates syntactically correct code consistently
Shows understanding of code structure
Makes meaningful incremental changes
Preserves function signatures

Signs model needs full rewrites:

Generates broken diffs
Adds comments instead of code
Loses track of changes
Inconsistent formatting

Best Practices

For Diff-Based Evolution

Clear Prompts: Show exact diff format
Good Examples: Include successful diffs
Incremental: Encourage small changes
Validation: Check diff applies cleanly

For Full Rewrite Evolution

Preserve Interface: Keep function signature
Maintain Correctness: Emphasize working code
Fresh Perspective: Don't show too much context
Radical Changes: Encourage new approaches

For Parallel Evaluation

Sufficient Workers: At least 4-8 parallel
Balanced Load: Distribute tasks evenly
Shared Database: Enable cross-learning
Migration: Allow program sharing

Evolution Patterns Observed

Successful Evolution Trajectory (Diff-Based)

Iteration 1-20: Syntax cleanup, minor optimizations
Iteration 21-40: Algorithm improvements emerge  
Iteration 41-60: Vectorization, library optimizations
Iteration 61-80: Fine-tuning, edge cases
Iteration 81-100: Convergence, minor tweaks

Failed Evolution Trajectory (Wrong Strategy)

Iteration 1-20: Broken changes, syntax errors
Iteration 21-40: Attempts to fix previous errors
Iteration 41-60: Giving up, adding comments
Iteration 61-80: Random changes, no progress
Iteration 81-100: Degraded below baseline

Key Experimental Findings

Model-Strategy Alignment
- Strong coding models performed better with diff-based evolution
- Weaker models needed full rewrites to make progress
- Flash Lite: 0.79x with diffs vs 1.10x with rewrites
Serial Evaluation Impact
- Parallel evaluation achieved 0.793x-1.10x performance
- Serial evaluation degraded to 0.396x-0.585x (47-50% worse)
- Time increased from 0.9 hours to 13+ hours
Evolution Health Indicators
- Diff-based evolution showed lower syntax error rates with capable models
- Full rewrite led to more exploration but higher failure rates
- Comment inflation occurred more frequently with full rewrites
Strategy Switching Observations
- Some experiments benefited from checkpoint-based strategy changes
- Switching strategies mid-run allowed recovery from poor trajectories
- Consistent strategy often outperformed switching

Future Investigation Areas

Hybrid Evolution Approaches
- Combining rewrite and diff strategies not yet tested
- Alternating strategies based on performance plateaus
- Task-specific strategy selection remains unexplored
Parallelism Optimization
- Dynamic worker allocation
- Priority-based scheduling
- Intelligent task grouping
Strategy Learning
- Predict best strategy from model
- Learn from early iterations
- Auto-switch on failure detection