Overview
Our initial experiments established baselines for different models and discovered key patterns that guided all subsequent optimization efforts.
Key Discoveries
- First experiments with Gemini Flash Lite and Qwen models established baselines
- Discovered diff-based evolution advantages for strong coding models
Baseline Experiments
1. Gemini Flash 2.5 Lite - First Test
Configuration: Gemini Flash 2.5 Lite, 1 iteration, full rewrite
- Purpose: Test single iteration baseline
- Result: 0.621x (worse than baseline!)
- Duration: 0.1 hours
- Learning: Single iteration insufficient for any meaningful evolution
2. Gemini Flash 2.5 Lite - Full Evolution
Configuration: Gemini Flash 2.5 Lite, 100 iterations, full rewrite
- Config:
- Full rewrite mode
- Temperature: 0.8 (default)
- Tokens: 4000
- 100 iterations
- Result: 1.10x speedup
- Best task: sha256_hashing (32.6x)
- Duration: 0.9 hours
Key Observations:
- Model achieved modest improvements
- High temperature (0.8) led to inconsistent results
- Established baseline for Flash Lite performance
3. Gemini Flash 2.5 Lite - Diff Evolution
Configuration: Gemini Flash 2.5 Lite, 100 iterations, diff-based
- Config: Same as above but diff-based evolution
- Result: 0.793x (WORSE than full rewrite!)
- Duration: 0.9 hours
Critical Discovery: Smaller models struggle with diff-based evolution
Example Failed Diff:
# Attempted optimization that failed
- visited = [False] * n
+ visited = [False] * n # No actual change
+ # TODO: Optimize this section
The model added comments instead of actual optimizations!
4. Qwen3-235B General Model
Configuration: Qwen3-235B-A22B, 100 iterations, full rewrite
- Config:
- Full rewrite mode
- 100 iterations
- General-purpose 235B model
- Result: 0.84x (worse than baseline)
- Duration: 2.8 hours
- Best task: toeplitz_solver (44.6x)
Surprising Finding: Despite 235B parameters, performed poorly overall
- General-purpose training not suitable for code optimization
- Many invalid programs generated
- Inconsistent performance across tasks
5. Qwen3-Coder - Full Rewrite
Configuration: Qwen3-Coder-480B, 100 iterations, full rewrite
- Config:
- Full rewrite mode
- Temperature: 0.6
- 480B MoE model (35B active)
- Result: 1.093x
- Duration: 6.8 hours
- Best task: matrix_multiplication (31.9x)
Observations:
- Coding-specialized model showed promise
- Slower due to model size
- Better than general Qwen3-235B despite similar active parameters
6. Qwen3-Coder - Diff Evolution
Configuration: Qwen3-Coder-480B, 100 iterations, diff-based
- Config: Diff-based evolution
- Result: 1.414x (30% improvement!)
- Duration: 4.9 hours
- Best task: psd_cone_projection (29.8x)
Breakthrough Discovery: Strong coding models excel with diffs
Successful Diff Example:
- eigenvalues[eigenvalues < 0] = 0
- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Vectorized operation for efficiency
+ eigenvalues = np.maximum(eigenvalues, 0)
+ A_psd = (eigenvectors * eigenvalues) @ eigenvectors.T
Real optimization with understanding of numpy broadcasting!
7. Early Ensemble Attempt
Configuration: Gemini Flash Ensemble, 100 iterations
- Config: Multiple Gemini variants
- Result: 0.98x (slight degradation)
- Duration: 1.0 hour
Learning: Simple ensembles don't automatically improve performance
- Need careful model selection
- Weight tuning important
- Led to later island-assignment insights
Pattern Discovery Summary
Model Capability Hierarchy
From these baselines, we established:
-
Coding-Specialized > General Purpose
- Qwen3-Coder (1.414x) >> Qwen3-235B (0.84x)
- Despite Qwen3-235B having more active parameters
-
Evolution Strategy Depends on Model
- Strong models: Diff-based wins
- Qwen3-Coder: 1.093x → 1.414x (+30%)
- Weak models: Full rewrite better
- Flash Lite: 1.10x → 0.793x (-28%)
-
Default Parameters Suboptimal
- Temperature 0.8 too high
- 4000 tokens insufficient
- Led to systematic parameter study
Task-Specific Insights
Easy Wins (>20x speedup achieved):
- psd_cone_projection
- toeplitz_solver
- matrix operations
Challenging Tasks:
- sha256_hashing (hardware-bound)
- Simple array operations (already optimized)
Evolution Behavior Patterns
Successful Evolution Characteristics:
- Gradual refinement over iterations
- Algorithmic changes (not just syntax)
- Maintained correctness while improving
Failed Evolution Characteristics:
- Syntax errors accumulate
- Lost track of original functionality
- Added complexity without benefit
Impact on Subsequent Experiments
These baseline experiments directly led to:
-
Temperature optimization study (Phase 2)
- Tested 0.2, 0.4, 0.8 based on 0.8 being too high
-
Token limit investigation (Phase 3)
- Increased from 4k to 16k/32k
-
Model selection strategy (Phase 4)
- Focus on coding-specialized models
- Avoid general-purpose models for code tasks
-
Evolution strategy guidelines (Phase 5)
- Match strategy to model capability
- Test both approaches for new models
Key Observations from Baseline Experiments
- Setup Validation: Single iteration tests successfully validated configurations
- Model Specialization: Coding-specialized models outperformed larger general models
- Evolution Strategy Impact: Wrong strategy choice degraded performance below baseline
- Baseline Importance: Initial measurements proved essential for comparison
- Time Investment: 100 iterations required hours but showed significant improvements
Experimental Patterns Observed
-
New Model Testing:
- Both evolution strategies showed different performance profiles
- Temperature parameters between 0.4-0.6 were commonly used
- 100 iterations provided stable performance measurements
-
Performance Characteristics:
- Specialized coding models achieved higher speedups
- Diff-based evolution worked better for capable models
- Extended iterations continued to improve results