Baseline Experiments Analysis

Overview

Our initial experiments established baselines for different models and discovered key patterns that guided all subsequent optimization efforts.

Key Discoveries

First experiments with Gemini Flash Lite and Qwen models established baselines
Discovered diff-based evolution advantages for strong coding models

Baseline Experiments

1. Gemini Flash 2.5 Lite - First Test

Configuration: Gemini Flash 2.5 Lite, 1 iteration, full rewrite

Purpose: Test single iteration baseline
Result: 0.621x (worse than baseline!)
Duration: 0.1 hours
Learning: Single iteration insufficient for any meaningful evolution

2. Gemini Flash 2.5 Lite - Full Evolution

Configuration: Gemini Flash 2.5 Lite, 100 iterations, full rewrite

Config:
- Full rewrite mode
- Temperature: 0.8 (default)
- Tokens: 4000
- 100 iterations
Result: 1.10x speedup
Best task: sha256_hashing (32.6x)
Duration: 0.9 hours

Key Observations:

Model achieved modest improvements
High temperature (0.8) led to inconsistent results
Established baseline for Flash Lite performance

3. Gemini Flash 2.5 Lite - Diff Evolution

Configuration: Gemini Flash 2.5 Lite, 100 iterations, diff-based

Config: Same as above but diff-based evolution
Result: 0.793x (WORSE than full rewrite!)
Duration: 0.9 hours

Critical Discovery: Smaller models struggle with diff-based evolution

Example Failed Diff:

# Attempted optimization that failed
- visited = [False] * n
+ visited = [False] * n  # No actual change
+ # TODO: Optimize this section

The model added comments instead of actual optimizations!

4. Qwen3-235B General Model

Configuration: Qwen3-235B-A22B, 100 iterations, full rewrite

Config:
- Full rewrite mode
- 100 iterations
- General-purpose 235B model
Result: 0.84x (worse than baseline)
Duration: 2.8 hours
Best task: toeplitz_solver (44.6x)

Surprising Finding: Despite 235B parameters, performed poorly overall

General-purpose training not suitable for code optimization
Many invalid programs generated
Inconsistent performance across tasks

5. Qwen3-Coder - Full Rewrite

Configuration: Qwen3-Coder-480B, 100 iterations, full rewrite

Config:
- Full rewrite mode
- Temperature: 0.6
- 480B MoE model (35B active)
Result: 1.093x
Duration: 6.8 hours
Best task: matrix_multiplication (31.9x)

Observations:

Coding-specialized model showed promise
Slower due to model size
Better than general Qwen3-235B despite similar active parameters

6. Qwen3-Coder - Diff Evolution

Configuration: Qwen3-Coder-480B, 100 iterations, diff-based

Config: Diff-based evolution
Result: 1.414x (30% improvement!)
Duration: 4.9 hours
Best task: psd_cone_projection (29.8x)

Breakthrough Discovery: Strong coding models excel with diffs

Successful Diff Example:

- eigenvalues[eigenvalues < 0] = 0
- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Vectorized operation for efficiency
+ eigenvalues = np.maximum(eigenvalues, 0)
+ A_psd = (eigenvectors * eigenvalues) @ eigenvectors.T

Real optimization with understanding of numpy broadcasting!

7. Early Ensemble Attempt

Configuration: Gemini Flash Ensemble, 100 iterations

Config: Multiple Gemini variants
Result: 0.98x (slight degradation)
Duration: 1.0 hour

Learning: Simple ensembles don't automatically improve performance

Need careful model selection
Weight tuning important
Led to later island-assignment insights

Pattern Discovery Summary

Model Capability Hierarchy

From these baselines, we established:

Coding-Specialized > General Purpose
- Qwen3-Coder (1.414x) >> Qwen3-235B (0.84x)
- Despite Qwen3-235B having more active parameters
Evolution Strategy Depends on Model
- Strong models: Diff-based wins
  - Qwen3-Coder: 1.093x → 1.414x (+30%)
- Weak models: Full rewrite better
  - Flash Lite: 1.10x → 0.793x (-28%)
Default Parameters Suboptimal
- Temperature 0.8 too high
- 4000 tokens insufficient
- Led to systematic parameter study

Task-Specific Insights

Easy Wins (>20x speedup achieved):

psd_cone_projection
toeplitz_solver
matrix operations

Challenging Tasks:

sha256_hashing (hardware-bound)
Simple array operations (already optimized)

Evolution Behavior Patterns

Successful Evolution Characteristics:

Gradual refinement over iterations
Algorithmic changes (not just syntax)
Maintained correctness while improving

Failed Evolution Characteristics:

Syntax errors accumulate
Lost track of original functionality
Added complexity without benefit

Impact on Subsequent Experiments

These baseline experiments directly led to:

Temperature optimization study (Phase 2)
- Tested 0.2, 0.4, 0.8 based on 0.8 being too high
Token limit investigation (Phase 3)
- Increased from 4k to 16k/32k
Model selection strategy (Phase 4)
- Focus on coding-specialized models
- Avoid general-purpose models for code tasks
Evolution strategy guidelines (Phase 5)
- Match strategy to model capability
- Test both approaches for new models

Key Observations from Baseline Experiments

Setup Validation: Single iteration tests successfully validated configurations
Model Specialization: Coding-specialized models outperformed larger general models
Evolution Strategy Impact: Wrong strategy choice degraded performance below baseline
Baseline Importance: Initial measurements proved essential for comparison
Time Investment: 100 iterations required hours but showed significant improvements

Experimental Patterns Observed

New Model Testing:
- Both evolution strategies showed different performance profiles
- Temperature parameters between 0.4-0.6 were commonly used
- 100 iterations provided stable performance measurements
Performance Characteristics:
- Specialized coding models achieved higher speedups
- Diff-based evolution worked better for capable models
- Extended iterations continued to improve results