Ensemble Analysis: Why More Models ≠ Better Results

Key Results

Individual Performance: Gemini Flash 2.5: 1.64x, Qwen3-Coder: 1.41x
Ensemble Performance: 1.23x (lower than either individual model)
Performance Drop: 25% below Gemini Flash 2.5, 13% below Qwen3-Coder
Primary Issue: Models pursued conflicting optimization strategies

Executive Summary

Despite combining our two best diff-based performers - Gemini Flash 2.5 (1.64x) and Qwen3-Coder (1.41x) - with a 60/40 weight split, the ensemble achieved only 1.23x speedup. This 19% underperformance reveals a critical limitation: when models have fundamentally different optimization approaches, they interfere rather than complement each other.

Ensemble Analysis

Experiment Details

Configuration

models:
  - name: "google/gemini-2.5-flash"
    weight: 0.6  # Best performer (1.520x on 100 iter)
  - name: "qwen/qwen3-coder"  
    weight: 0.4  # Strong diff performer (1.414x)
    
temperature: 0.4      # Optimal from experiments
max_tokens: 16000     # Optimal context
diff_based_evolution: true  # Both excel with diffs

Results Summary

Overall Score: 1.226x (harmonic mean)
Best Task: psd_cone_projection at 36.0x
Worst Underperformance: count_connected_components at 18.5x (vs 48.1x Gemini, 25.0x Qwen)
Duration: 3.4 hours
Success Rate: 100% (but suboptimal performance)

Why the Ensemble Failed

1. Conflicting Algorithm Paradigms

The most striking example is count_connected_components, where models pursued incompatible approaches:

Gemini Flash 2.5 Approach (BFS)

# Iteration 23: Gemini suggests BFS optimization
def solve(problem):
    from collections import deque
    # ... adjacency list setup ...
    
    for start in range(n):
        if visited[start]:
            continue
        queue = deque([start])  # BFS approach
        visited[start] = True
        while queue:
            node = queue.popleft()
            for neighbor in adj[node]:
                if not visited[neighbor]:
                    visited[neighbor] = True
                    queue.append(neighbor)

Qwen3-Coder Approach (Union-Find)

# Iteration 24: Qwen suggests Union-Find
def solve(problem):
    parent = list(range(n))
    
    def find(x):
        if parent[x] != x:
            parent[x] = find(parent[x])  # Path compression
        return parent[x]
    
    def union(x, y):
        px, py = find(x), find(y)
        if px != py:
            parent[px] = py

Result: Oscillation and Compromise

# Iteration 25: Ensemble produces hybrid mess
def solve(problem):
    from collections import deque
    parent = list(range(n))  # Union-Find structure
    visited = [False] * n     # BFS structure
    
    # Incomplete BFS that also tries to maintain parent pointers
    # Neither approach is properly optimized!

2. Evolution Oscillation Pattern

The ensemble exhibited a characteristic oscillation pattern:

Iterations 1-20: Gemini's BFS approach dominates (60% weight)
Iterations 21-40: Qwen's optimizations partially revert BFS, start Union-Find
Iterations 41-60: Conflict phase - neither approach fully implemented
Iterations 61-100: Stagnation with hybrid non-optimal solution

This is clearly visible in the evolution trajectory visualization where the ensemble plateaus at 1.23x while individual models continue improving.

3. Task-Specific Analysis

Task	Gemini Solo	Qwen Solo	Ensemble	Loss
count_connected_components	48.1x	25.0x	18.5x	-62% from best
psd_cone_projection	32.7x	41.9x	36.0x	-14% from best
dct_type_I	6.48x	6.48x	3.01x	-54% from best
matrix_multiplication	2.20x	2.15x	2.18x	-1% (minimal loss)
sha256_hashing	1.10x	1.12x	3.51x	+214% (anomaly)

Pattern: Tasks with similar optimization approaches (matrix_multiplication) showed minimal loss, while tasks with different algorithmic solutions (count_components, dct) showed severe degradation.

4. Specific Examples of Interference

Example 1: DCT Optimization Conflict

# Gemini's approach (Iteration 35)
signal = np.array(problem["signal"], dtype=np.float64)  # Type optimization

# Qwen's approach (Iteration 36)  
signal = np.asarray(problem["signal"])  # Different array creation
result = dct(signal, type=1, norm='ortho')  # Added normalization

# Ensemble result (Iteration 37)
signal = np.array(problem["signal"])  # Lost dtype optimization!
result = dct(signal, type=1)  # Lost normalization!

Example 2: Lost Optimizations in PSD Projection

# Gemini discovered (Iteration 67):
A_psd = (v * np.maximum(w, 0)) @ v.T  # Vectorized operation

# Qwen tried (Iteration 68):
w_positive = np.where(w > 0, w, 0)  # Different approach
A_psd = v @ np.diag(w_positive) @ v.T

# Ensemble compromised (Iteration 69):
eigenvalues[eigenvalues < 0] = 0  # Back to original!
A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T

5. Model Agreement Analysis

We analyzed where models agreed vs disagreed:

High Agreement Tasks (helped ensemble):

sha256_hashing: Both models found similar minimal optimizations
matrix_multiplication: Both used similar numpy optimizations
linear_system_solver: Both improved memory allocation similarly

Low Agreement Tasks (hurt ensemble):

count_connected_components: BFS vs Union-Find
edge_expansion: Different graph representations
l0_pruning: Sorting vs heap approaches

Analysis of Ensemble Behavior

Observed Success Patterns

The ensemble performed better when:

Models suggested similar optimization approaches (e.g., both used numpy optimizations)
Tasks had limited optimization paths (e.g., SHA256 hashing)
Changes were syntactic rather than algorithmic

Observed Failure Patterns

The ensemble performed worse when:

Models chose fundamentally different algorithms (BFS vs Union-Find observed)
One model's optimization was reverted by another
Models alternated between memory vs compute optimizations

Actual Ensemble Configurations Tested

60/40 Weight Split: Gemini Flash 2.5 (60%) + Qwen3-Coder (40%)
- Result: 1.226x speedup (worse than either model alone)
Migration Settings: Used standard 4 islands with 0.1 migration rate
- Islands did not prevent optimization conflicts

Evidence from Actual Runs

Oscillation in count_connected_components

From the ensemble experiment results for count_connected_components:

Iteration 10: Performance 0.178 (BFS partially implemented)
Iteration 25: Performance 0.221 (Union-Find started)
Iteration 40: Performance 0.195 (Hybrid mess)
Iteration 55: Performance 0.203 (Minor recovery)
Final: Performance 0.405 (18.5x speedup vs potential 48.1x)

Success in sha256_hashing

Both models agreed on minimal optimizations:

Remove intermediate variables
Combine operations into one line
Both recognized hardware limitations

Result: 3.51x speedup (anomalously high, likely measurement variance)

Summary of Findings

The ensemble experiment tested combining two high-performing models (Gemini Flash 2.5 at 1.64x and Qwen3-Coder at 1.41x) with a 60/40 weight distribution. The resulting performance of 1.23x was lower than either individual model.

Key observations:

Models discovered different algorithmic solutions for the same problems
Evolution oscillated between approaches rather than converging
Performance degraded most on tasks where models had conflicting strategies
Tasks with limited optimization paths showed less degradation

The experiment demonstrates that model ensemble performance in code evolution depends on alignment of optimization approaches rather than individual model strength.