Towards Open Evolutionary Agents

Executive Summary

A comprehensive analysis of 29 experiments using OpenEvolve, an open-source evolutionary coding agent, on the AlgoTune benchmark suite reveals that both proprietary and open models achieve significant performance gains. Notably, open models like Google's Gemma 3 27B and Alibaba's Qwen3-Coder 480B demonstrated impressive optimization capabilities, rivalling, and sometimes uniquely surpassing their closed-source counterparts. Gemini Flash 2.5 achieved the top speedup of 2.04x, with Gemma 3 27B and Qwen3-Coder reaching 1.63x and 1.41x, respectively.

Executive Dashboard

Dashboard Overview: This 2x2 visualization summarizes key experimental findings:

A. Temperature Optimization Study: Shows pure temperature comparison (0.2, 0.4, 0.8) with clean performance curves
B. Extended Iterations: Demonstrates continued performance gains from 100 to 200 iterations
C. All Models Tested (Best Config): Comprehensive comparison of all 10 unique models at their optimal configurations
D. Parallelism Impact: Shows critical performance degradation with serial evaluation

Key Findings

More Iterations Yield Better Results: 200 iterations achieved 2.04x speedup vs 1.64x for 100 iterations (24% improvement)
Specialization Beats Size: Qwen3-Coder (480B MoE, coding-specialized) outperformed Qwen3-235B general model
Evolution Strategy is Model-Dependent: Strong coding models excel with diff-based evolution; weaker models need full rewrites
Temperature Optimization Matters: 0.4 proved optimal for Gemini models (1.29x vs 1.17x at 0.2)
Artifacts Boost Performance: Including debugging artifacts improved results by 17%
Parallelism is Critical: Serial evaluation catastrophically fails (50% worse performance, 14x slower)

Top Performers

Model	Config	AlgoTune Score	Best Task
Gemini Flash 2.5	200 iter, diff, temp 0.4	2.04x	count_connected_components: 95.78x
Gemini Flash 2.5	100 iter, diff, temp 0.4	1.64x	psd_cone_projection: 32.7x
Gemma 3 27B	100 iter, diff	1.63x	psd_cone_projection: 41.1x
Qwen3-Coder 480B	100 iter, diff, temp 0.6	1.41x	psd_cone_projection: 41.9x

Experiment Statistics

Total Experiments: 29
Models Tested: 10 unique (Gemini family, Qwen family, Meta, Google Open, Ensembles)
Tasks Evaluated: 30 AlgoTune benchmarks
Best Overall Result: 2.04x speedup (Gemini Flash 2.5, 200 iterations)
Worst Result: 0.396x (Serial diff evaluation - 50% performance degradation)
Temperature Range Tested: 0.2, 0.4, 0.8
Evolution Strategies: Diff-based and full rewrite
Evaluation Approach: Parallel (16 workers) vs Serial comparison

Experimental Overview
Methodology
Detailed Results
Evolved Code Analysis
Parameter Optimization Study
Model Comparison
Technical Implementation
Conclusions

Detailed Experiment Analyses

Baseline Experiments - Initial model testing and key discoveries
Temperature Study - Optimal temperature analysis (0.2, 0.4, 0.8)
Parameter Tuning - Token limits, artifacts, migration rates
Evolution Strategies - Diff vs full rewrite, serial vs parallel
Model Comparison - Comprehensive model performance analysis
Iteration Impact - 100 vs 200 iterations deep dive
Ensemble Analysis - Why combining models failed

Experimental Overview

Experiment Phases

Phase 1: Initial experiments and parameter tuning
Phase 2: Extended iteration experiments and new models

Models Tested

Gemini Family
- Flash 2.5: Google's efficient coding model
- Flash 2.5 Lite: Smaller variant for baseline testing
Qwen Family
- Qwen3-Coder-480B-A35B-Instruct: 480B MoE with 35B active, coding-specialized
- Qwen3-235B-A22B: 235B general-purpose model
- Qwen3-32B: Smaller variant
Other Models
- Gemma 3 27B: Google's open model
- Llama 3.3 70B: Meta's latest model
- Ensemble configurations

Experiment Categories

Baseline Testing (7 experiments): Establish model baselines
Temperature Study (3 experiments): Find optimal temperature
Parameter Tuning (8 experiments): Optimize various parameters
Model Comparison (6 experiments): Cross-model analysis
Extended Iterations (4 experiments): Impact of more evolution

Methodology

AlgoTune Integration

Converted 30 AlgoTune tasks to OpenEvolve format using algotune_to_openevolve.py
Tasks span various algorithm categories: graphs, linear algebra, optimization
Used AlgoTune's standard evaluation: harmonic mean of speedups

Evolution Approach

Island-based MAP-Elites: 4 islands with periodic migration
Cascade evaluation: Quick validation → performance testing → full evaluation
Two evolution strategies:
- Diff-based: Incremental code changes
- Full rewrite: Complete function regeneration

Performance Metric

AlgoTune uses a logarithmic performance scale that gets converted to speedup factors:

# AlgoTune score calculation
speedup = (10 ** (performance * 3)) - 1  # Convert from log scale
algotune_score = harmonic_mean(all_speedups)  # Overall benchmark score

The harmonic mean emphasizes consistent performance across all tasks - a model must perform well on all 30 benchmarks to achieve a high score.

Detailed Results

Performance Rankings - All Experiments

Performance Rankings

Chart Explanation: Horizontal bars show AlgoTune scores (harmonic mean of speedups) for all experiments. Green bars indicate good performance (>1.5x), blue for moderate (1.2-1.5x), orange for modest (1.0-1.2x), and red for poor (<1.0x). The dashed line at 1.0x represents baseline performance.

Phase 1: Baseline Experiments

1. Gemini Flash 2.5 Lite Baseline

Config: Full rewrite, temp 0.8, 4k tokens
Result: 1.10x speedup
Key Learning: Established baseline for optimization

2. Evolution Strategy Discovery

Comparing diff-based vs full rewrite on Flash Lite:

Full rewrite: 1.10x ✓
Diff-based: 0.79x ✗

Evidence of Struggle with Diffs:

# Attempted diff that failed
- for i in range(n):
-     if not visited[i]:
-         dfs(i)
+ for i in range(n):
+     if not visited[i]:
+         # Incomplete optimization
+         dfs(i)  # No actual improvement

Phase 2: Temperature Optimization Study

Systematic testing on Gemini Flash 2.5 Lite with 16k tokens:

Temperature	AlgoTune Score	Avg Performance	Duration	Key Finding
0.2	1.17x	0.162	4074s	Conservative but stable
0.4	1.29x	0.175	3784s	Best performance achieved
0.8	1.02x	0.159	3309s	Too random, performance drops

Statistical Significance: 10.3% improvement from 0.2→0.4, 20.9% degradation from 0.4→0.8

Phase 3: Parameter Fine-Tuning

All experiments at optimal temp 0.4:

Token Limit Impact

16k tokens: 1.29x (baseline)
32k tokens: 1.28x (no improvement)
Conclusion: 16k sufficient, larger context doesn't help

Artifacts Impact

With artifacts: 1.29x
Without artifacts: 1.07x
Impact: 17% performance loss without debugging info

Note: "Artifacts" in OpenEvolve are debugging outputs that programs can return during evaluation, helping the LLM understand execution behavior and make better optimization decisions.

Top Programs for Inspiration

Top 3: 1.29x (baseline)
Top 5: 1.28x (minimal difference)
Conclusion: 3 programs sufficient

Migration Rate

0.1 rate: 1.29x (baseline)
0.2 rate: 1.25x (slightly worse)
Conclusion: Less frequent migration preserves diversity

Phase 4: Applying Learnings to Strong Models

Gemini Flash 2.5 (100 iterations)

Config: Diff-based, temp 0.4, 16k tokens
Result: 1.64x speedup
Top achievement: count_connected_components at 48.1x

Qwen3-Coder-480B

Config: Diff-based, temp 0.6, 4k tokens
Result: 1.41x speedup
Note: Only tested at temp 0.6, may not be optimal

Gemma 3 27B

Config: Diff-based (inherited from successful configs)
Result: 1.63x speedup
Surprise: Matched Flash 2.5 performance

Phase 5: Extended Iterations

Gemini Flash 2.5 (200 iterations)

Result: 2.04x speedup (24% improvement over 100 iter)
Breakthrough: count_connected_components reached 95.78x
Found optimal: In just iteration 2!

Progress over iterations:

Iteration 10: 1.15x
Iteration 50: 1.48x
Iteration 100: 1.64x
Iteration 200: 2.04x

Phase 6: Serial vs Parallel Evaluation

Critical Finding: Parallelism is Essential

We tested serial evaluation (one task at a time) vs parallel evaluation:

Configuration	Evolution Type	AlgoTune Score	Duration	vs Parallel
Parallel	Diff-based	0.793x	0.9 hours	Baseline
Serial	Diff-based	0.396x	13.0 hours	50% worse, 14x slower
Parallel	Full rewrite	1.10x	0.9 hours	Baseline
Serial	Full rewrite	0.585x	13.1 hours	47% worse, 14x slower

Key Insights:

Performance Catastrophe: Serial evaluation leads to 47-50% worse results
Time Explosion: 14x slower execution (13 hours vs 0.9 hours)
Why It Fails:
- No cross-task learning during evolution
- Timeout issues compound across serial tasks
- Evolution gets "stuck" on difficult tasks
- Lost opportunity for parallel discovery

Conclusion: Parallel evaluation is not just an optimization - it's essential for OpenEvolve to function properly

Evolved Code Analysis

Example 1: Count Connected Components - Algorithm Discovery

Original Implementation (All models start here):

def solve(problem):
    n = problem["num_nodes"]
    edges = problem["edges"]
    
    # Build adjacency list
    adj = [[] for _ in range(n)]
    for u, v in edges:
        adj[u].append(v)
        adj[v].append(u)
    
    # DFS to count components
    visited = [False] * n
    count = 0
    
    def dfs(node):
        visited[node] = True
        for neighbor in adj[node]:
            if not visited[neighbor]:
                dfs(neighbor)
    
    for i in range(n):
        if not visited[i]:
            dfs(i)
            count += 1
    
    return {"number_connected_components": count}

Performance: 1.0x (baseline)

Evolved - Gemini Flash 2.5 (200 iter):

def solve(problem):
    n = problem.get("num_nodes", 0)
    edges = problem.get("edges", [])
    
    if n == 0:
        return {"number_connected_components": 0}
    
    # Build adjacency list with deque for O(1) operations
    adj = [[] for _ in range(n)]
    for u, v in edges:
        adj[u].append(v)
        adj[v].append(u)
    
    # BFS with early termination and efficient tracking
    visited = [False] * n
    count = 0
    from collections import deque
    
    for start in range(n):
        if visited[start]:
            continue
            
        # BFS for this component
        queue = deque([start])
        visited[start] = True
        
        while queue:
            node = queue.popleft()
            for neighbor in adj[node]:
                if not visited[neighbor]:
                    visited[neighbor] = True
                    queue.append(neighbor)
        
        count += 1
    
    return {"number_connected_components": count}

Performance: 95.78x speedup Key Optimizations:

Switched from DFS to BFS
Used deque for O(1) queue operations
Added early termination checks
Defensive programming with get()

Evolved - Qwen3-Coder (100 iter):

def solve(problem):
    n = problem["num_nodes"]
    edges = problem["edges"]
    
    # Union-Find approach - completely different algorithm!
    parent = list(range(n))
    
    def find(x):
        if parent[x] != x:
            parent[x] = find(parent[x])  # Path compression
        return parent[x]
    
    def union(x, y):
        px, py = find(x), find(y)
        if px != py:
            parent[px] = py
    
    # Process edges
    for u, v in edges:
        union(u, v)
    
    # Count unique roots
    return {"number_connected_components": len(set(find(i) for i in range(n)))}

Performance: ~25x speedup Key Insight: Different model found completely different algorithm!

Union-Find with path compression
O(α(n)) per operation (near constant)
More memory efficient

Example 2: PSD Cone Projection - Incremental Optimization

Original Implementation:

def solve(problem):
    import numpy as np
    
    A = np.array(problem["matrix"])
    # Eigenvalue decomposition
    eigenvalues, eigenvectors = np.linalg.eigh(A)
    # Set negative eigenvalues to zero
    eigenvalues[eigenvalues < 0] = 0
    # Reconstruct matrix
    A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
    return {"projected_matrix": A_psd.tolist()}

Performance: 1.0x (baseline)

Evolution with Gemini Flash 2.5 Diffs:

Iteration 23:

- eigenvalues[eigenvalues < 0] = 0
- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Vectorized operation
+ eigenvalues = np.maximum(eigenvalues, 0)
+ A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T

Performance: 1.8x

Iteration 67:

- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Eliminate intermediate array
+ A_psd = (eigenvectors * eigenvalues) @ eigenvectors.T

Performance: 15.2x

Final (Iteration 95):

def solve(problem):
    import numpy as np
    
    A = np.array(problem["matrix"])
    w, v = np.linalg.eigh(A)
    # One-line vectorized operation
    A_psd = (v * np.maximum(w, 0)) @ v.T
    return {"projected_matrix": A_psd.tolist()}

Performance: 32.7x speedup

Example 3: DCT Optimization - Convergent Evolution

Original:

def solve(problem):
    import numpy as np
    from scipy.fftpack import dct
    
    signal = np.array(problem["signal"])
    return {"dct": dct(signal, type=1).tolist()}

Multiple Models Converged to Same Solution:

Gemini Flash 2.5:

signal = np.array(problem["signal"], dtype=np.float64)
result = dct(signal, type=1, norm=None, overwrite_x=False)

Qwen3-Coder:

signal = np.asarray(problem["signal"], dtype=np.float64)
return {"dct": dct(signal, type=1).tolist()}

Key Finding: Both discovered dtype=np.float64 optimization independently!

6.48x speedup for both
Shows optimal solution exists and is discoverable

Example 4: SHA256 - Hardware Limitations

Original:

def solve(problem):
    import hashlib
    
    data = problem["data"].encode('utf-8')
    hash_object = hashlib.sha256(data)
    return {"hash": hash_object.hexdigest()}

Best Evolution (multiple models):

def solve(problem):
    import hashlib
    
    return {"hash": hashlib.sha256(problem["data"].encode()).hexdigest()}

Performance: Only 1.1x speedup Learning: Hardware-bound operations have limited optimization potential

Model Comparison

For a comprehensive view of all 10 unique models tested, see Panel C in the Executive Dashboard. The complete 29-experiment results are shown in the Performance Rankings chart above.

Task-Specific Performance Heatmap

Task Performance Heatmap

This heatmap shows speedup factors achieved by different models on specific AlgoTune tasks. Darker red indicates higher speedup. Notable patterns:

Count connected components shows extreme variation (95x for Flash 2.5 vs 15x for Gemma 3)
PSD projection achieved consistent high performance across models (32-42x)
Hardware-bound tasks like SHA256 show minimal improvement regardless of model

Specialization vs Size Analysis

Qwen3-Coder (480B MoE, 35B active) vs Qwen3-235B:

Despite being "larger" (480B total parameters), Qwen3-Coder has only 35B active
Coding specialization led to 68% better performance (1.41x vs 0.84x)
Evidence that training data and objectives matter more than size

Evolution Strategy by Model Capability

| Model Type | Best Strategy | Evidence | |------------|---------------|----------| | Small/Weak (Flash Lite) | Full rewrite | 0.79x with diff vs 1.10x full | | Strong Coding (Flash 2.5, Qwen3-Coder) | Diff-based | 1.64x vs lower with full | | General Purpose | Varies | Model-dependent |

Ensemble Experiments: Why More Models ≠ Better Results

Despite combining our two best performers (Gemini Flash 2.5 at 1.64x and Qwen3-Coder at 1.41x), the ensemble achieved only 1.23x speedup - worse than either model individually.

Ensemble Analysis

Visualization Guide:

Panel A: Shows individual model performance vs ensemble - ensemble underperforms both models
Panel B: All 30 Tasks Performance Comparison - complete performance profiles for Gemini, Qwen, and Ensemble
Panel C: Evolution progress comparison - ensemble shows irregular oscillation vs smooth single model progress
Panel D: Model Agreement by Task - 3x30 heatmap showing pairwise agreement for each task, revealing conflict zones

Key Finding: Conflicting Optimization Strategies

The ensemble failed because models pursued incompatible approaches:

Count Connected Components: Gemini used BFS, Qwen used Union-Find → oscillation
DCT Optimization: Different dtype strategies → lost optimizations
Result: 19% underperformance vs expected 1.5-1.7x

Example: Algorithm Conflict

# Iteration N: Gemini suggests BFS
queue = deque([start])  # BFS approach

# Iteration N+1: Qwen suggests Union-Find  
parent = list(range(n))  # Different algorithm!

# Result: Hybrid mess that optimizes neither

Observed Ensemble Failure Modes

Algorithm conflicts: BFS vs Union-Find approaches in same task
Optimization conflicts: Memory optimization vs compute optimization
Style conflicts: Functional vs imperative code patterns

→ Detailed ensemble analysis

Technical Implementation Details

OpenEvolve Configuration

# Optimal configuration discovered
max_iterations: 200  # More is better
diff_based_evolution: true  # For capable models
temperature: 0.4  # For Gemini, 0.6 for others
max_tokens: 16000  # Sweet spot
num_top_programs: 3
num_islands: 4
migration_interval: 20
migration_rate: 0.1
include_artifacts: true  # Critical for performance

Island Evolution Dynamics

4 islands maintain diversity
Migration every 20 iterations prevents premature convergence
Each island can discover different solutions
Best program tracked globally

Cascade Evaluation Benefits

Stage 1: Quick syntax/import validation (< 1s)
Stage 2: Small test cases (< 10s)
Stage 3: Full benchmark suite (< 60s)

Saved ~70% of evaluation time by failing fast.

Key Observations from Experiments

1. Performance Improvements by Type

Algorithmic changes (DFS → BFS): Up to 95x speedup observed
Vectorization optimizations: 32x speedup achieved in matrix operations
Minor code changes (variable renaming, loop adjustments): Typically < 2x speedup

2. Model Solution Diversity

Count connected components: Gemini found BFS approach, Qwen found Union-Find
Matrix operations: Different models used different vectorization strategies
Multiple valid optimization paths were discovered for most tasks

3. Evolution Strategy Performance

Diff-based evolution with strong coding models: Up to 1.64x overall speedup
Full rewrite with weaker models: 1.10x speedup (vs 0.79x with diffs)
Model capability determined optimal strategy

4. Iteration Impact

100 iterations: 1.64x average speedup achieved
200 iterations: 2.04x average speedup (24% improvement)
Performance continued improving through 200 iterations

5. Parameter Effects

Temperature 0.4: Best individual run (1.291x) but high variance
Including artifacts: ~17% performance improvement observed
Token limit: 16k vs 32k showed no significant difference
Migration rate: 0.1 outperformed 0.2 in experiments

6. Parallelism Impact

Serial evaluation: 0.396x-0.585x speedup (performance degradation)
Parallel evaluation: 0.793x-1.10x speedup for same configurations
Time difference: 13 hours (serial) vs 0.9 hours (parallel)

Conclusions

Key Experimental Findings

Iteration Count: 200 iterations achieved 2.04x speedup vs 1.64x for 100 iterations (24% improvement)
Model Specialization: Qwen3-Coder (coding-specialized) outperformed larger general models
Temperature Settings: Results varied - best individual run at 0.4, but high variance observed
Evolution Strategy: Diff-based worked better for strong coding models, full rewrite for weaker models
Parallelism: Serial evaluation degraded performance by 47-50% and increased time 14x
Ensemble Results: Combined models achieved 1.23x vs 1.64x and 1.41x individually

Observed Patterns

Models discovered different algorithmic approaches (BFS vs Union-Find for graph problems)
Hardware-bound tasks (SHA256) showed minimal improvement across all configurations
Artifacts inclusion improved performance by approximately 17%
Migration rate of 0.1 performed better than 0.2 in tested configurations