A comprehensive analysis of 29 experiments using OpenEvolve, an open-source evolutionary coding agent, on the AlgoTune benchmark suite reveals that both proprietary and open models achieve significant performance gains. Notably, open models like Google's Gemma 3 27B and Alibaba's Qwen3-Coder 480B demonstrated impressive optimization capabilities, rivalling, and sometimes uniquely surpassing their closed-source counterparts. Gemini Flash 2.5 achieved the top speedup of 2.04x, with Gemma 3 27B and Qwen3-Coder reaching 1.63x and 1.41x, respectively.
Dashboard Overview: This 2x2 visualization summarizes key experimental findings:
Model | Config | AlgoTune Score | Best Task |
---|---|---|---|
Gemini Flash 2.5 | 200 iter, diff, temp 0.4 | 2.04x | count_connected_components: 95.78x |
Gemini Flash 2.5 | 100 iter, diff, temp 0.4 | 1.64x | psd_cone_projection: 32.7x |
Gemma 3 27B | 100 iter, diff | 1.63x | psd_cone_projection: 41.1x |
Qwen3-Coder 480B | 100 iter, diff, temp 0.6 | 1.41x | psd_cone_projection: 41.9x |
Gemini Family
Qwen Family
Other Models
algotune_to_openevolve.py
AlgoTune uses a logarithmic performance scale that gets converted to speedup factors:
# AlgoTune score calculation
speedup = (10 ** (performance * 3)) - 1 # Convert from log scale
algotune_score = harmonic_mean(all_speedups) # Overall benchmark score
The harmonic mean emphasizes consistent performance across all tasks - a model must perform well on all 30 benchmarks to achieve a high score.
Chart Explanation: Horizontal bars show AlgoTune scores (harmonic mean of speedups) for all experiments. Green bars indicate good performance (>1.5x), blue for moderate (1.2-1.5x), orange for modest (1.0-1.2x), and red for poor (<1.0x). The dashed line at 1.0x represents baseline performance.
Comparing diff-based vs full rewrite on Flash Lite:
Evidence of Struggle with Diffs:
# Attempted diff that failed
- for i in range(n):
- if not visited[i]:
- dfs(i)
+ for i in range(n):
+ if not visited[i]:
+ # Incomplete optimization
+ dfs(i) # No actual improvement
Systematic testing on Gemini Flash 2.5 Lite with 16k tokens:
Temperature | AlgoTune Score | Avg Performance | Duration | Key Finding |
---|---|---|---|---|
0.2 | 1.17x | 0.162 | 4074s | Conservative but stable |
0.4 | 1.29x | 0.175 | 3784s | Best performance achieved |
0.8 | 1.02x | 0.159 | 3309s | Too random, performance drops |
Statistical Significance: 10.3% improvement from 0.2→0.4, 20.9% degradation from 0.4→0.8
All experiments at optimal temp 0.4:
Note: "Artifacts" in OpenEvolve are debugging outputs that programs can return during evaluation, helping the LLM understand execution behavior and make better optimization decisions.
Progress over iterations:
We tested serial evaluation (one task at a time) vs parallel evaluation:
Configuration | Evolution Type | AlgoTune Score | Duration | vs Parallel |
---|---|---|---|---|
Parallel | Diff-based | 0.793x | 0.9 hours | Baseline |
Serial | Diff-based | 0.396x | 13.0 hours | 50% worse, 14x slower |
Parallel | Full rewrite | 1.10x | 0.9 hours | Baseline |
Serial | Full rewrite | 0.585x | 13.1 hours | 47% worse, 14x slower |
Key Insights:
Conclusion: Parallel evaluation is not just an optimization - it's essential for OpenEvolve to function properly
Original Implementation (All models start here):
def solve(problem):
n = problem["num_nodes"]
edges = problem["edges"]
# Build adjacency list
adj = [[] for _ in range(n)]
for u, v in edges:
adj[u].append(v)
adj[v].append(u)
# DFS to count components
visited = [False] * n
count = 0
def dfs(node):
visited[node] = True
for neighbor in adj[node]:
if not visited[neighbor]:
dfs(neighbor)
for i in range(n):
if not visited[i]:
dfs(i)
count += 1
return {"number_connected_components": count}
Performance: 1.0x (baseline)
Evolved - Gemini Flash 2.5 (200 iter):
def solve(problem):
n = problem.get("num_nodes", 0)
edges = problem.get("edges", [])
if n == 0:
return {"number_connected_components": 0}
# Build adjacency list with deque for O(1) operations
adj = [[] for _ in range(n)]
for u, v in edges:
adj[u].append(v)
adj[v].append(u)
# BFS with early termination and efficient tracking
visited = [False] * n
count = 0
from collections import deque
for start in range(n):
if visited[start]:
continue
# BFS for this component
queue = deque([start])
visited[start] = True
while queue:
node = queue.popleft()
for neighbor in adj[node]:
if not visited[neighbor]:
visited[neighbor] = True
queue.append(neighbor)
count += 1
return {"number_connected_components": count}
Performance: 95.78x speedup Key Optimizations:
Evolved - Qwen3-Coder (100 iter):
def solve(problem):
n = problem["num_nodes"]
edges = problem["edges"]
# Union-Find approach - completely different algorithm!
parent = list(range(n))
def find(x):
if parent[x] != x:
parent[x] = find(parent[x]) # Path compression
return parent[x]
def union(x, y):
px, py = find(x), find(y)
if px != py:
parent[px] = py
# Process edges
for u, v in edges:
union(u, v)
# Count unique roots
return {"number_connected_components": len(set(find(i) for i in range(n)))}
Performance: ~25x speedup Key Insight: Different model found completely different algorithm!
Original Implementation:
def solve(problem):
import numpy as np
A = np.array(problem["matrix"])
# Eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eigh(A)
# Set negative eigenvalues to zero
eigenvalues[eigenvalues < 0] = 0
# Reconstruct matrix
A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
return {"projected_matrix": A_psd.tolist()}
Performance: 1.0x (baseline)
Evolution with Gemini Flash 2.5 Diffs:
Iteration 23:
- eigenvalues[eigenvalues < 0] = 0
- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Vectorized operation
+ eigenvalues = np.maximum(eigenvalues, 0)
+ A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
Performance: 1.8x
Iteration 67:
- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Eliminate intermediate array
+ A_psd = (eigenvectors * eigenvalues) @ eigenvectors.T
Performance: 15.2x
Final (Iteration 95):
def solve(problem):
import numpy as np
A = np.array(problem["matrix"])
w, v = np.linalg.eigh(A)
# One-line vectorized operation
A_psd = (v * np.maximum(w, 0)) @ v.T
return {"projected_matrix": A_psd.tolist()}
Performance: 32.7x speedup
Original:
def solve(problem):
import numpy as np
from scipy.fftpack import dct
signal = np.array(problem["signal"])
return {"dct": dct(signal, type=1).tolist()}
Multiple Models Converged to Same Solution:
Gemini Flash 2.5:
signal = np.array(problem["signal"], dtype=np.float64)
result = dct(signal, type=1, norm=None, overwrite_x=False)
Qwen3-Coder:
signal = np.asarray(problem["signal"], dtype=np.float64)
return {"dct": dct(signal, type=1).tolist()}
Key Finding: Both discovered dtype=np.float64 optimization independently!
Original:
def solve(problem):
import hashlib
data = problem["data"].encode('utf-8')
hash_object = hashlib.sha256(data)
return {"hash": hash_object.hexdigest()}
Best Evolution (multiple models):
def solve(problem):
import hashlib
return {"hash": hashlib.sha256(problem["data"].encode()).hexdigest()}
Performance: Only 1.1x speedup Learning: Hardware-bound operations have limited optimization potential
For a comprehensive view of all 10 unique models tested, see Panel C in the Executive Dashboard. The complete 29-experiment results are shown in the Performance Rankings chart above.
This heatmap shows speedup factors achieved by different models on specific AlgoTune tasks. Darker red indicates higher speedup. Notable patterns:
Qwen3-Coder (480B MoE, 35B active) vs Qwen3-235B:
| Model Type | Best Strategy | Evidence | |------------|---------------|----------| | Small/Weak (Flash Lite) | Full rewrite | 0.79x with diff vs 1.10x full | | Strong Coding (Flash 2.5, Qwen3-Coder) | Diff-based | 1.64x vs lower with full | | General Purpose | Varies | Model-dependent |
Despite combining our two best performers (Gemini Flash 2.5 at 1.64x and Qwen3-Coder at 1.41x), the ensemble achieved only 1.23x speedup - worse than either model individually.
Visualization Guide:
The ensemble failed because models pursued incompatible approaches:
# Iteration N: Gemini suggests BFS
queue = deque([start]) # BFS approach
# Iteration N+1: Qwen suggests Union-Find
parent = list(range(n)) # Different algorithm!
# Result: Hybrid mess that optimizes neither
# Optimal configuration discovered
max_iterations: 200 # More is better
diff_based_evolution: true # For capable models
temperature: 0.4 # For Gemini, 0.6 for others
max_tokens: 16000 # Sweet spot
num_top_programs: 3
num_islands: 4
migration_interval: 20
migration_rate: 0.1
include_artifacts: true # Critical for performance
Saved ~70% of evaluation time by failing fast.