Despite combining our two best diff-based performers - Gemini Flash 2.5 (1.64x) and Qwen3-Coder (1.41x) - with a 60/40 weight split, the ensemble achieved only 1.23x speedup. This 19% underperformance reveals a critical limitation: when models have fundamentally different optimization approaches, they interfere rather than complement each other.
models:
- name: "google/gemini-2.5-flash"
weight: 0.6 # Best performer (1.520x on 100 iter)
- name: "qwen/qwen3-coder"
weight: 0.4 # Strong diff performer (1.414x)
temperature: 0.4 # Optimal from experiments
max_tokens: 16000 # Optimal context
diff_based_evolution: true # Both excel with diffs
The most striking example is count_connected_components, where models pursued incompatible approaches:
# Iteration 23: Gemini suggests BFS optimization
def solve(problem):
from collections import deque
# ... adjacency list setup ...
for start in range(n):
if visited[start]:
continue
queue = deque([start]) # BFS approach
visited[start] = True
while queue:
node = queue.popleft()
for neighbor in adj[node]:
if not visited[neighbor]:
visited[neighbor] = True
queue.append(neighbor)
# Iteration 24: Qwen suggests Union-Find
def solve(problem):
parent = list(range(n))
def find(x):
if parent[x] != x:
parent[x] = find(parent[x]) # Path compression
return parent[x]
def union(x, y):
px, py = find(x), find(y)
if px != py:
parent[px] = py
# Iteration 25: Ensemble produces hybrid mess
def solve(problem):
from collections import deque
parent = list(range(n)) # Union-Find structure
visited = [False] * n # BFS structure
# Incomplete BFS that also tries to maintain parent pointers
# Neither approach is properly optimized!
The ensemble exhibited a characteristic oscillation pattern:
This is clearly visible in the evolution trajectory visualization where the ensemble plateaus at 1.23x while individual models continue improving.
Task | Gemini Solo | Qwen Solo | Ensemble | Loss |
---|---|---|---|---|
count_connected_components | 48.1x | 25.0x | 18.5x | -62% from best |
psd_cone_projection | 32.7x | 41.9x | 36.0x | -14% from best |
dct_type_I | 6.48x | 6.48x | 3.01x | -54% from best |
matrix_multiplication | 2.20x | 2.15x | 2.18x | -1% (minimal loss) |
sha256_hashing | 1.10x | 1.12x | 3.51x | +214% (anomaly) |
Pattern: Tasks with similar optimization approaches (matrix_multiplication) showed minimal loss, while tasks with different algorithmic solutions (count_components, dct) showed severe degradation.
# Gemini's approach (Iteration 35)
signal = np.array(problem["signal"], dtype=np.float64) # Type optimization
# Qwen's approach (Iteration 36)
signal = np.asarray(problem["signal"]) # Different array creation
result = dct(signal, type=1, norm='ortho') # Added normalization
# Ensemble result (Iteration 37)
signal = np.array(problem["signal"]) # Lost dtype optimization!
result = dct(signal, type=1) # Lost normalization!
# Gemini discovered (Iteration 67):
A_psd = (v * np.maximum(w, 0)) @ v.T # Vectorized operation
# Qwen tried (Iteration 68):
w_positive = np.where(w > 0, w, 0) # Different approach
A_psd = v @ np.diag(w_positive) @ v.T
# Ensemble compromised (Iteration 69):
eigenvalues[eigenvalues < 0] = 0 # Back to original!
A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
We analyzed where models agreed vs disagreed:
High Agreement Tasks (helped ensemble):
Low Agreement Tasks (hurt ensemble):
The ensemble performed better when:
The ensemble performed worse when:
60/40 Weight Split: Gemini Flash 2.5 (60%) + Qwen3-Coder (40%)
Migration Settings: Used standard 4 islands with 0.1 migration rate
From the ensemble experiment results for count_connected_components:
Both models agreed on minimal optimizations:
Result: 3.51x speedup (anomalously high, likely measurement variance)
The ensemble experiment tested combining two high-performing models (Gemini Flash 2.5 at 1.64x and Qwen3-Coder at 1.41x) with a 60/40 weight distribution. The resulting performance of 1.23x was lower than either individual model.
Key observations:
The experiment demonstrates that model ensemble performance in code evolution depends on alignment of optimization approaches rather than individual model strength.