background
TowardsOpenEvolutionaryAgents
Acomprehensiveanalysisof29experimentsusingOpenEvolve,anopen-sourceevolutionarycodingagent,ontheAlgoTunebenchmarksuiterevealsthatbothproprietaryandopenmodelsachievesignificantperformancegains.
Towards Open Evolutionary Agents

Executive Summary

A comprehensive analysis of 29 experiments using OpenEvolve, an open-source evolutionary coding agent, on the AlgoTune benchmark suite reveals that both proprietary and open models achieve significant performance gains. Notably, open models like Google's Gemma 3 27B and Alibaba's Qwen3-Coder 480B demonstrated impressive optimization capabilities, rivalling, and sometimes uniquely surpassing their closed-source counterparts. Gemini Flash 2.5 achieved the top speedup of 2.04x, with Gemma 3 27B and Qwen3-Coder reaching 1.63x and 1.41x, respectively.

Executive Dashboard

Dashboard Overview: This 2x2 visualization summarizes key experimental findings:

  • A. Temperature Optimization Study: Shows pure temperature comparison (0.2, 0.4, 0.8) with clean performance curves
  • B. Extended Iterations: Demonstrates continued performance gains from 100 to 200 iterations
  • C. All Models Tested (Best Config): Comprehensive comparison of all 10 unique models at their optimal configurations
  • D. Parallelism Impact: Shows critical performance degradation with serial evaluation

Key Findings

  1. More Iterations Yield Better Results: 200 iterations achieved 2.04x speedup vs 1.64x for 100 iterations (24% improvement)
  2. Specialization Beats Size: Qwen3-Coder (480B MoE, coding-specialized) outperformed Qwen3-235B general model
  3. Evolution Strategy is Model-Dependent: Strong coding models excel with diff-based evolution; weaker models need full rewrites
  4. Temperature Optimization Matters: 0.4 proved optimal for Gemini models (1.29x vs 1.17x at 0.2)
  5. Artifacts Boost Performance: Including debugging artifacts improved results by 17%
  6. Parallelism is Critical: Serial evaluation catastrophically fails (50% worse performance, 14x slower)

Top Performers

ModelConfigAlgoTune ScoreBest Task
Gemini Flash 2.5200 iter, diff, temp 0.42.04xcount_connected_components: 95.78x
Gemini Flash 2.5100 iter, diff, temp 0.41.64xpsd_cone_projection: 32.7x
Gemma 3 27B100 iter, diff1.63xpsd_cone_projection: 41.1x
Qwen3-Coder 480B100 iter, diff, temp 0.61.41xpsd_cone_projection: 41.9x

Experiment Statistics

  • Total Experiments: 29
  • Models Tested: 10 unique (Gemini family, Qwen family, Meta, Google Open, Ensembles)
  • Tasks Evaluated: 30 AlgoTune benchmarks
  • Best Overall Result: 2.04x speedup (Gemini Flash 2.5, 200 iterations)
  • Worst Result: 0.396x (Serial diff evaluation - 50% performance degradation)
  • Temperature Range Tested: 0.2, 0.4, 0.8
  • Evolution Strategies: Diff-based and full rewrite
  • Evaluation Approach: Parallel (16 workers) vs Serial comparison

Table of Contents

  1. Experimental Overview
  2. Methodology
  3. Detailed Results
  4. Evolved Code Analysis
  5. Parameter Optimization Study
  6. Model Comparison
  7. Technical Implementation
  8. Conclusions

Detailed Experiment Analyses

Experimental Overview

Experiment Phases

  • Phase 1: Initial experiments and parameter tuning
  • Phase 2: Extended iteration experiments and new models

Models Tested

  1. Gemini Family

    • Flash 2.5: Google's efficient coding model
    • Flash 2.5 Lite: Smaller variant for baseline testing
  2. Qwen Family

    • Qwen3-Coder-480B-A35B-Instruct: 480B MoE with 35B active, coding-specialized
    • Qwen3-235B-A22B: 235B general-purpose model
    • Qwen3-32B: Smaller variant
  3. Other Models

    • Gemma 3 27B: Google's open model
    • Llama 3.3 70B: Meta's latest model
    • Ensemble configurations

Experiment Categories

  1. Baseline Testing (7 experiments): Establish model baselines
  2. Temperature Study (3 experiments): Find optimal temperature
  3. Parameter Tuning (8 experiments): Optimize various parameters
  4. Model Comparison (6 experiments): Cross-model analysis
  5. Extended Iterations (4 experiments): Impact of more evolution

Methodology

AlgoTune Integration

  • Converted 30 AlgoTune tasks to OpenEvolve format using algotune_to_openevolve.py
  • Tasks span various algorithm categories: graphs, linear algebra, optimization
  • Used AlgoTune's standard evaluation: harmonic mean of speedups

Evolution Approach

  • Island-based MAP-Elites: 4 islands with periodic migration
  • Cascade evaluation: Quick validation → performance testing → full evaluation
  • Two evolution strategies:
    • Diff-based: Incremental code changes
    • Full rewrite: Complete function regeneration

Performance Metric

AlgoTune uses a logarithmic performance scale that gets converted to speedup factors:

# AlgoTune score calculation
speedup = (10 ** (performance * 3)) - 1  # Convert from log scale
algotune_score = harmonic_mean(all_speedups)  # Overall benchmark score

The harmonic mean emphasizes consistent performance across all tasks - a model must perform well on all 30 benchmarks to achieve a high score.

Detailed Results

Performance Rankings - All Experiments

Performance Rankings

Chart Explanation: Horizontal bars show AlgoTune scores (harmonic mean of speedups) for all experiments. Green bars indicate good performance (>1.5x), blue for moderate (1.2-1.5x), orange for modest (1.0-1.2x), and red for poor (<1.0x). The dashed line at 1.0x represents baseline performance.

Phase 1: Baseline Experiments

1. Gemini Flash 2.5 Lite Baseline

  • Config: Full rewrite, temp 0.8, 4k tokens
  • Result: 1.10x speedup
  • Key Learning: Established baseline for optimization

2. Evolution Strategy Discovery

Comparing diff-based vs full rewrite on Flash Lite:

  • Full rewrite: 1.10x ✓
  • Diff-based: 0.79x ✗

Evidence of Struggle with Diffs:

# Attempted diff that failed
- for i in range(n):
-     if not visited[i]:
-         dfs(i)
+ for i in range(n):
+     if not visited[i]:
+         # Incomplete optimization
+         dfs(i)  # No actual improvement

Phase 2: Temperature Optimization Study

Systematic testing on Gemini Flash 2.5 Lite with 16k tokens:

TemperatureAlgoTune ScoreAvg PerformanceDurationKey Finding
0.21.17x0.1624074sConservative but stable
0.41.29x0.1753784sBest performance achieved
0.81.02x0.1593309sToo random, performance drops

Statistical Significance: 10.3% improvement from 0.2→0.4, 20.9% degradation from 0.4→0.8

Phase 3: Parameter Fine-Tuning

All experiments at optimal temp 0.4:

Token Limit Impact

  • 16k tokens: 1.29x (baseline)
  • 32k tokens: 1.28x (no improvement)
  • Conclusion: 16k sufficient, larger context doesn't help

Artifacts Impact

  • With artifacts: 1.29x
  • Without artifacts: 1.07x
  • Impact: 17% performance loss without debugging info

Note: "Artifacts" in OpenEvolve are debugging outputs that programs can return during evaluation, helping the LLM understand execution behavior and make better optimization decisions.

Top Programs for Inspiration

  • Top 3: 1.29x (baseline)
  • Top 5: 1.28x (minimal difference)
  • Conclusion: 3 programs sufficient

Migration Rate

  • 0.1 rate: 1.29x (baseline)
  • 0.2 rate: 1.25x (slightly worse)
  • Conclusion: Less frequent migration preserves diversity

Phase 4: Applying Learnings to Strong Models

Gemini Flash 2.5 (100 iterations)

  • Config: Diff-based, temp 0.4, 16k tokens
  • Result: 1.64x speedup
  • Top achievement: count_connected_components at 48.1x

Qwen3-Coder-480B

  • Config: Diff-based, temp 0.6, 4k tokens
  • Result: 1.41x speedup
  • Note: Only tested at temp 0.6, may not be optimal

Gemma 3 27B

  • Config: Diff-based (inherited from successful configs)
  • Result: 1.63x speedup
  • Surprise: Matched Flash 2.5 performance

Phase 5: Extended Iterations

Gemini Flash 2.5 (200 iterations)

  • Result: 2.04x speedup (24% improvement over 100 iter)
  • Breakthrough: count_connected_components reached 95.78x
  • Found optimal: In just iteration 2!

Progress over iterations:

  • Iteration 10: 1.15x
  • Iteration 50: 1.48x
  • Iteration 100: 1.64x
  • Iteration 200: 2.04x

Phase 6: Serial vs Parallel Evaluation

Critical Finding: Parallelism is Essential

We tested serial evaluation (one task at a time) vs parallel evaluation:

ConfigurationEvolution TypeAlgoTune ScoreDurationvs Parallel
ParallelDiff-based0.793x0.9 hoursBaseline
SerialDiff-based0.396x13.0 hours50% worse, 14x slower
ParallelFull rewrite1.10x0.9 hoursBaseline
SerialFull rewrite0.585x13.1 hours47% worse, 14x slower

Key Insights:

  1. Performance Catastrophe: Serial evaluation leads to 47-50% worse results
  2. Time Explosion: 14x slower execution (13 hours vs 0.9 hours)
  3. Why It Fails:
    • No cross-task learning during evolution
    • Timeout issues compound across serial tasks
    • Evolution gets "stuck" on difficult tasks
    • Lost opportunity for parallel discovery

Conclusion: Parallel evaluation is not just an optimization - it's essential for OpenEvolve to function properly

Evolved Code Analysis

Example 1: Count Connected Components - Algorithm Discovery

Original Implementation (All models start here):

def solve(problem):
    n = problem["num_nodes"]
    edges = problem["edges"]
    
    # Build adjacency list
    adj = [[] for _ in range(n)]
    for u, v in edges:
        adj[u].append(v)
        adj[v].append(u)
    
    # DFS to count components
    visited = [False] * n
    count = 0
    
    def dfs(node):
        visited[node] = True
        for neighbor in adj[node]:
            if not visited[neighbor]:
                dfs(neighbor)
    
    for i in range(n):
        if not visited[i]:
            dfs(i)
            count += 1
    
    return {"number_connected_components": count}

Performance: 1.0x (baseline)

Evolved - Gemini Flash 2.5 (200 iter):

def solve(problem):
    n = problem.get("num_nodes", 0)
    edges = problem.get("edges", [])
    
    if n == 0:
        return {"number_connected_components": 0}
    
    # Build adjacency list with deque for O(1) operations
    adj = [[] for _ in range(n)]
    for u, v in edges:
        adj[u].append(v)
        adj[v].append(u)
    
    # BFS with early termination and efficient tracking
    visited = [False] * n
    count = 0
    from collections import deque
    
    for start in range(n):
        if visited[start]:
            continue
            
        # BFS for this component
        queue = deque([start])
        visited[start] = True
        
        while queue:
            node = queue.popleft()
            for neighbor in adj[node]:
                if not visited[neighbor]:
                    visited[neighbor] = True
                    queue.append(neighbor)
        
        count += 1
    
    return {"number_connected_components": count}

Performance: 95.78x speedup Key Optimizations:

  • Switched from DFS to BFS
  • Used deque for O(1) queue operations
  • Added early termination checks
  • Defensive programming with get()

Evolved - Qwen3-Coder (100 iter):

def solve(problem):
    n = problem["num_nodes"]
    edges = problem["edges"]
    
    # Union-Find approach - completely different algorithm!
    parent = list(range(n))
    
    def find(x):
        if parent[x] != x:
            parent[x] = find(parent[x])  # Path compression
        return parent[x]
    
    def union(x, y):
        px, py = find(x), find(y)
        if px != py:
            parent[px] = py
    
    # Process edges
    for u, v in edges:
        union(u, v)
    
    # Count unique roots
    return {"number_connected_components": len(set(find(i) for i in range(n)))}

Performance: ~25x speedup Key Insight: Different model found completely different algorithm!

  • Union-Find with path compression
  • O(α(n)) per operation (near constant)
  • More memory efficient

Example 2: PSD Cone Projection - Incremental Optimization

Original Implementation:

def solve(problem):
    import numpy as np
    
    A = np.array(problem["matrix"])
    # Eigenvalue decomposition
    eigenvalues, eigenvectors = np.linalg.eigh(A)
    # Set negative eigenvalues to zero
    eigenvalues[eigenvalues < 0] = 0
    # Reconstruct matrix
    A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
    return {"projected_matrix": A_psd.tolist()}

Performance: 1.0x (baseline)

Evolution with Gemini Flash 2.5 Diffs:

Iteration 23:

- eigenvalues[eigenvalues < 0] = 0
- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Vectorized operation
+ eigenvalues = np.maximum(eigenvalues, 0)
+ A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T

Performance: 1.8x

Iteration 67:

- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Eliminate intermediate array
+ A_psd = (eigenvectors * eigenvalues) @ eigenvectors.T

Performance: 15.2x

Final (Iteration 95):

def solve(problem):
    import numpy as np
    
    A = np.array(problem["matrix"])
    w, v = np.linalg.eigh(A)
    # One-line vectorized operation
    A_psd = (v * np.maximum(w, 0)) @ v.T
    return {"projected_matrix": A_psd.tolist()}

Performance: 32.7x speedup

Example 3: DCT Optimization - Convergent Evolution

Original:

def solve(problem):
    import numpy as np
    from scipy.fftpack import dct
    
    signal = np.array(problem["signal"])
    return {"dct": dct(signal, type=1).tolist()}

Multiple Models Converged to Same Solution:

Gemini Flash 2.5:

signal = np.array(problem["signal"], dtype=np.float64)
result = dct(signal, type=1, norm=None, overwrite_x=False)

Qwen3-Coder:

signal = np.asarray(problem["signal"], dtype=np.float64)
return {"dct": dct(signal, type=1).tolist()}

Key Finding: Both discovered dtype=np.float64 optimization independently!

  • 6.48x speedup for both
  • Shows optimal solution exists and is discoverable

Example 4: SHA256 - Hardware Limitations

Original:

def solve(problem):
    import hashlib
    
    data = problem["data"].encode('utf-8')
    hash_object = hashlib.sha256(data)
    return {"hash": hash_object.hexdigest()}

Best Evolution (multiple models):

def solve(problem):
    import hashlib
    
    return {"hash": hashlib.sha256(problem["data"].encode()).hexdigest()}

Performance: Only 1.1x speedup Learning: Hardware-bound operations have limited optimization potential

Model Comparison

For a comprehensive view of all 10 unique models tested, see Panel C in the Executive Dashboard. The complete 29-experiment results are shown in the Performance Rankings chart above.

Task-Specific Performance Heatmap

Task Performance Heatmap

This heatmap shows speedup factors achieved by different models on specific AlgoTune tasks. Darker red indicates higher speedup. Notable patterns:

  • Count connected components shows extreme variation (95x for Flash 2.5 vs 15x for Gemma 3)
  • PSD projection achieved consistent high performance across models (32-42x)
  • Hardware-bound tasks like SHA256 show minimal improvement regardless of model

Specialization vs Size Analysis

Qwen3-Coder (480B MoE, 35B active) vs Qwen3-235B:

  • Despite being "larger" (480B total parameters), Qwen3-Coder has only 35B active
  • Coding specialization led to 68% better performance (1.41x vs 0.84x)
  • Evidence that training data and objectives matter more than size

Evolution Strategy by Model Capability

| Model Type | Best Strategy | Evidence | |------------|---------------|----------| | Small/Weak (Flash Lite) | Full rewrite | 0.79x with diff vs 1.10x full | | Strong Coding (Flash 2.5, Qwen3-Coder) | Diff-based | 1.64x vs lower with full | | General Purpose | Varies | Model-dependent |

Ensemble Experiments: Why More Models ≠ Better Results

Despite combining our two best performers (Gemini Flash 2.5 at 1.64x and Qwen3-Coder at 1.41x), the ensemble achieved only 1.23x speedup - worse than either model individually.

Ensemble Analysis

Visualization Guide:

  • Panel A: Shows individual model performance vs ensemble - ensemble underperforms both models
  • Panel B: All 30 Tasks Performance Comparison - complete performance profiles for Gemini, Qwen, and Ensemble
  • Panel C: Evolution progress comparison - ensemble shows irregular oscillation vs smooth single model progress
  • Panel D: Model Agreement by Task - 3x30 heatmap showing pairwise agreement for each task, revealing conflict zones

Key Finding: Conflicting Optimization Strategies

The ensemble failed because models pursued incompatible approaches:

  • Count Connected Components: Gemini used BFS, Qwen used Union-Find → oscillation
  • DCT Optimization: Different dtype strategies → lost optimizations
  • Result: 19% underperformance vs expected 1.5-1.7x

Example: Algorithm Conflict

# Iteration N: Gemini suggests BFS
queue = deque([start])  # BFS approach

# Iteration N+1: Qwen suggests Union-Find  
parent = list(range(n))  # Different algorithm!

# Result: Hybrid mess that optimizes neither

Observed Ensemble Failure Modes

  1. Algorithm conflicts: BFS vs Union-Find approaches in same task
  2. Optimization conflicts: Memory optimization vs compute optimization
  3. Style conflicts: Functional vs imperative code patterns

→ Detailed ensemble analysis

Technical Implementation Details

OpenEvolve Configuration

# Optimal configuration discovered
max_iterations: 200  # More is better
diff_based_evolution: true  # For capable models
temperature: 0.4  # For Gemini, 0.6 for others
max_tokens: 16000  # Sweet spot
num_top_programs: 3
num_islands: 4
migration_interval: 20
migration_rate: 0.1
include_artifacts: true  # Critical for performance

Island Evolution Dynamics

  • 4 islands maintain diversity
  • Migration every 20 iterations prevents premature convergence
  • Each island can discover different solutions
  • Best program tracked globally

Cascade Evaluation Benefits

  1. Stage 1: Quick syntax/import validation (< 1s)
  2. Stage 2: Small test cases (< 10s)
  3. Stage 3: Full benchmark suite (< 60s)

Saved ~70% of evaluation time by failing fast.

Key Observations from Experiments

1. Performance Improvements by Type

  • Algorithmic changes (DFS → BFS): Up to 95x speedup observed
  • Vectorization optimizations: 32x speedup achieved in matrix operations
  • Minor code changes (variable renaming, loop adjustments): Typically < 2x speedup

2. Model Solution Diversity

  • Count connected components: Gemini found BFS approach, Qwen found Union-Find
  • Matrix operations: Different models used different vectorization strategies
  • Multiple valid optimization paths were discovered for most tasks

3. Evolution Strategy Performance

  • Diff-based evolution with strong coding models: Up to 1.64x overall speedup
  • Full rewrite with weaker models: 1.10x speedup (vs 0.79x with diffs)
  • Model capability determined optimal strategy

4. Iteration Impact

  • 100 iterations: 1.64x average speedup achieved
  • 200 iterations: 2.04x average speedup (24% improvement)
  • Performance continued improving through 200 iterations

5. Parameter Effects

  • Temperature 0.4: Best individual run (1.291x) but high variance
  • Including artifacts: ~17% performance improvement observed
  • Token limit: 16k vs 32k showed no significant difference
  • Migration rate: 0.1 outperformed 0.2 in experiments

6. Parallelism Impact

  • Serial evaluation: 0.396x-0.585x speedup (performance degradation)
  • Parallel evaluation: 0.793x-1.10x speedup for same configurations
  • Time difference: 13 hours (serial) vs 0.9 hours (parallel)

Conclusions

Key Experimental Findings

  1. Iteration Count: 200 iterations achieved 2.04x speedup vs 1.64x for 100 iterations (24% improvement)
  2. Model Specialization: Qwen3-Coder (coding-specialized) outperformed larger general models
  3. Temperature Settings: Results varied - best individual run at 0.4, but high variance observed
  4. Evolution Strategy: Diff-based worked better for strong coding models, full rewrite for weaker models
  5. Parallelism: Serial evaluation degraded performance by 47-50% and increased time 14x
  6. Ensemble Results: Combined models achieved 1.23x vs 1.64x and 1.41x individually

Observed Patterns

  • Models discovered different algorithmic approaches (BFS vs Union-Find for graph problems)
  • Hardware-bound tasks (SHA256) showed minimal improvement across all configurations
  • Artifacts inclusion improved performance by approximately 17%
  • Migration rate of 0.1 performed better than 0.2 in tested configurations
Join the community