Overview
We tested two fundamental evolution strategies: diff-based (incremental changes) and full rewrite (complete regeneration). Additionally, we discovered that parallelism is not just an optimization but a critical requirement.

Evolution Strategy Comparison
Diff-Based Evolution
Approach: Generate incremental changes to existing code
- old_code
+ new_code
Advantages:
- Preserves working logic
- Gradual refinement
- Easier to track changes
- Lower chance of breaking
Disadvantages:
- Requires understanding of diffs
- Can get stuck in local optima
- Harder for weak models
Full Rewrite Evolution
Approach: Generate complete new implementation
def solve(problem):
# Entirely new implementation
Advantages:
- Can make radical improvements
- Escape local optima
- Simpler for models
- Fresh perspective each time
Disadvantages:
- May lose working optimizations
- Higher chance of syntax errors
- More compute intensive
Model-Specific Results
Strong Coding Models
Gemini Flash 2.5
- Full rewrite: ~1.4x (estimated from Flash Lite scaling)
- Diff-based: 1.637x ✓
- Improvement: +17% with diffs
Example Successful Diff:
# Iteration 45 - PSD Projection optimization
- eigenvalues[eigenvalues < 0] = 0
- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Vectorized operation eliminates intermediate array
+ A_psd = (eigenvectors * np.maximum(eigenvalues, 0)) @ eigenvectors.T
Clear understanding of optimization opportunity!
Qwen3-Coder (480B MoE)
- Full rewrite: 1.093x
- Diff-based: 1.414x ✓
- Improvement: +29% with diffs
Example Successful Diff:
# Iteration 67 - Graph algorithm optimization
- def dfs(node):
- visited[node] = True
- for neighbor in adj[node]:
- if not visited[neighbor]:
- dfs(neighbor)
+ # Switch to iterative approach
+ stack = [node]
+ while stack:
+ curr = stack.pop()
+ if not visited[curr]:
+ visited[curr] = True
+ stack.extend(adj[curr])
Algorithmic improvement with correct diff syntax!
Weak/Small Models
Gemini Flash 2.5 Lite
- Full rewrite: 1.10x ✓
- Diff-based: 0.793x ✗
- Degradation: -28% with diffs!
Example Failed Diff Attempts:
# Iteration 23 - Attempted optimization
- result = np.dot(A, B)
+ result = np.dot(A, B) # TODO: optimize this
# Iteration 45 - Confused diff
- for i in range(n):
+ for i in range(n): # This loop could be faster
+ # Maybe use numpy?
Model adds comments instead of actual optimizations!
Serial vs Parallel Evolution
The Catastrophic Failure of Serial Evaluation
We tested running tasks sequentially vs in parallel:
Parallel Evaluation (Standard)
How it works:
- All 30 tasks evolve simultaneously
- Cross-pollination between tasks
- Shared discoveries
- Load balanced across workers
Results:
- Flash Lite (diff): 0.793x in 0.9 hours
- Flash Lite (full): 1.10x in 0.9 hours
Serial Evaluation
How it works:
- Tasks evolve one at a time
- No cross-learning
- Sequential processing
- Single worker thread
Results:
- Flash Lite (diff): 0.396x in 13.0 hours (50% worse!)
- Flash Lite (full): 0.585x in 13.1 hours (47% worse!)
Why Serial Fails So Badly
-
No Cross-Task Learning
- Optimization discovered for task A can't help task B
- Each task starts from scratch
- Lost synergies
-
Compound Timeout Issues
Task 1: 26 minutes (some timeouts)
Task 2: 28 minutes (more timeouts)
...
Task 30: 35 minutes (cascading delays)
Total: 13+ hours!
-
Evolution Gets Stuck
- Hard tasks block progress
- No parallel exploration
- Single failure point
-
Resource Inefficiency
- CPU mostly idle
- No pipeline benefits
- Wasted evaluation cycles
Parallel Discovery Example
Count Connected Components optimization discovered in parallel:
- Task A: Discovers BFS is faster than DFS
- Task B: Simultaneously discovers deque optimization
- Task C: Finds early termination strategy
- Combined: All three insights merge → 95x speedup!
In serial: Each discovery happens in isolation, rarely combining.
Strategy Selection Guidelines
Decision Tree
Is the model coding-specialized?
├─ Yes: Is it large/capable (>30B)?
│ ├─ Yes: Use diff-based evolution
│ └─ No: Test both, likely full rewrite
└─ No: Use full rewrite
Always use parallel evaluation!
Model Capability Indicators
Signs model can handle diffs:
- Generates syntactically correct code consistently
- Shows understanding of code structure
- Makes meaningful incremental changes
- Preserves function signatures
Signs model needs full rewrites:
- Generates broken diffs
- Adds comments instead of code
- Loses track of changes
- Inconsistent formatting
Best Practices
For Diff-Based Evolution
- Clear Prompts: Show exact diff format
- Good Examples: Include successful diffs
- Incremental: Encourage small changes
- Validation: Check diff applies cleanly
For Full Rewrite Evolution
- Preserve Interface: Keep function signature
- Maintain Correctness: Emphasize working code
- Fresh Perspective: Don't show too much context
- Radical Changes: Encourage new approaches
For Parallel Evaluation
- Sufficient Workers: At least 4-8 parallel
- Balanced Load: Distribute tasks evenly
- Shared Database: Enable cross-learning
- Migration: Allow program sharing
Evolution Patterns Observed
Successful Evolution Trajectory (Diff-Based)
Iteration 1-20: Syntax cleanup, minor optimizations
Iteration 21-40: Algorithm improvements emerge
Iteration 41-60: Vectorization, library optimizations
Iteration 61-80: Fine-tuning, edge cases
Iteration 81-100: Convergence, minor tweaks
Failed Evolution Trajectory (Wrong Strategy)
Iteration 1-20: Broken changes, syntax errors
Iteration 21-40: Attempts to fix previous errors
Iteration 41-60: Giving up, adding comments
Iteration 61-80: Random changes, no progress
Iteration 81-100: Degraded below baseline
Key Experimental Findings
-
Model-Strategy Alignment
- Strong coding models performed better with diff-based evolution
- Weaker models needed full rewrites to make progress
- Flash Lite: 0.79x with diffs vs 1.10x with rewrites
-
Serial Evaluation Impact
- Parallel evaluation achieved 0.793x-1.10x performance
- Serial evaluation degraded to 0.396x-0.585x (47-50% worse)
- Time increased from 0.9 hours to 13+ hours
-
Evolution Health Indicators
- Diff-based evolution showed lower syntax error rates with capable models
- Full rewrite led to more exploration but higher failure rates
- Comment inflation occurred more frequently with full rewrites
-
Strategy Switching Observations
- Some experiments benefited from checkpoint-based strategy changes
- Switching strategies mid-run allowed recovery from poor trajectories
- Consistent strategy often outperformed switching
Future Investigation Areas
-
Hybrid Evolution Approaches
- Combining rewrite and diff strategies not yet tested
- Alternating strategies based on performance plateaus
- Task-specific strategy selection remains unexplored
-
Parallelism Optimization
- Dynamic worker allocation
- Priority-based scheduling
- Intelligent task grouping
-
Strategy Learning
- Predict best strategy from model
- Learn from early iterations
- Auto-switch on failure detection