background
BaselineExperimentsAnalysis
BaselineExperimentsAnalysis
Baseline Experiments Analysis

Overview

Our initial experiments established baselines for different models and discovered key patterns that guided all subsequent optimization efforts.

Key Discoveries

  • First experiments with Gemini Flash Lite and Qwen models established baselines
  • Discovered diff-based evolution advantages for strong coding models

Baseline Experiments

1. Gemini Flash 2.5 Lite - First Test

Configuration: Gemini Flash 2.5 Lite, 1 iteration, full rewrite

  • Purpose: Test single iteration baseline
  • Result: 0.621x (worse than baseline!)
  • Duration: 0.1 hours
  • Learning: Single iteration insufficient for any meaningful evolution

2. Gemini Flash 2.5 Lite - Full Evolution

Configuration: Gemini Flash 2.5 Lite, 100 iterations, full rewrite

  • Config:
    • Full rewrite mode
    • Temperature: 0.8 (default)
    • Tokens: 4000
    • 100 iterations
  • Result: 1.10x speedup
  • Best task: sha256_hashing (32.6x)
  • Duration: 0.9 hours

Key Observations:

  • Model achieved modest improvements
  • High temperature (0.8) led to inconsistent results
  • Established baseline for Flash Lite performance

3. Gemini Flash 2.5 Lite - Diff Evolution

Configuration: Gemini Flash 2.5 Lite, 100 iterations, diff-based

  • Config: Same as above but diff-based evolution
  • Result: 0.793x (WORSE than full rewrite!)
  • Duration: 0.9 hours

Critical Discovery: Smaller models struggle with diff-based evolution

Example Failed Diff:

# Attempted optimization that failed
- visited = [False] * n
+ visited = [False] * n  # No actual change
+ # TODO: Optimize this section

The model added comments instead of actual optimizations!

4. Qwen3-235B General Model

Configuration: Qwen3-235B-A22B, 100 iterations, full rewrite

  • Config:
    • Full rewrite mode
    • 100 iterations
    • General-purpose 235B model
  • Result: 0.84x (worse than baseline)
  • Duration: 2.8 hours
  • Best task: toeplitz_solver (44.6x)

Surprising Finding: Despite 235B parameters, performed poorly overall

  • General-purpose training not suitable for code optimization
  • Many invalid programs generated
  • Inconsistent performance across tasks

5. Qwen3-Coder - Full Rewrite

Configuration: Qwen3-Coder-480B, 100 iterations, full rewrite

  • Config:
    • Full rewrite mode
    • Temperature: 0.6
    • 480B MoE model (35B active)
  • Result: 1.093x
  • Duration: 6.8 hours
  • Best task: matrix_multiplication (31.9x)

Observations:

  • Coding-specialized model showed promise
  • Slower due to model size
  • Better than general Qwen3-235B despite similar active parameters

6. Qwen3-Coder - Diff Evolution

Configuration: Qwen3-Coder-480B, 100 iterations, diff-based

  • Config: Diff-based evolution
  • Result: 1.414x (30% improvement!)
  • Duration: 4.9 hours
  • Best task: psd_cone_projection (29.8x)

Breakthrough Discovery: Strong coding models excel with diffs

Successful Diff Example:

- eigenvalues[eigenvalues < 0] = 0
- A_psd = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
+ # Vectorized operation for efficiency
+ eigenvalues = np.maximum(eigenvalues, 0)
+ A_psd = (eigenvectors * eigenvalues) @ eigenvectors.T

Real optimization with understanding of numpy broadcasting!

7. Early Ensemble Attempt

Configuration: Gemini Flash Ensemble, 100 iterations

  • Config: Multiple Gemini variants
  • Result: 0.98x (slight degradation)
  • Duration: 1.0 hour

Learning: Simple ensembles don't automatically improve performance

  • Need careful model selection
  • Weight tuning important
  • Led to later island-assignment insights

Pattern Discovery Summary

Model Capability Hierarchy

From these baselines, we established:

  1. Coding-Specialized > General Purpose

    • Qwen3-Coder (1.414x) >> Qwen3-235B (0.84x)
    • Despite Qwen3-235B having more active parameters
  2. Evolution Strategy Depends on Model

    • Strong models: Diff-based wins
      • Qwen3-Coder: 1.093x → 1.414x (+30%)
    • Weak models: Full rewrite better
      • Flash Lite: 1.10x → 0.793x (-28%)
  3. Default Parameters Suboptimal

    • Temperature 0.8 too high
    • 4000 tokens insufficient
    • Led to systematic parameter study

Task-Specific Insights

Easy Wins (>20x speedup achieved):

  • psd_cone_projection
  • toeplitz_solver
  • matrix operations

Challenging Tasks:

  • sha256_hashing (hardware-bound)
  • Simple array operations (already optimized)

Evolution Behavior Patterns

Successful Evolution Characteristics:

  • Gradual refinement over iterations
  • Algorithmic changes (not just syntax)
  • Maintained correctness while improving

Failed Evolution Characteristics:

  • Syntax errors accumulate
  • Lost track of original functionality
  • Added complexity without benefit

Impact on Subsequent Experiments

These baseline experiments directly led to:

  1. Temperature optimization study (Phase 2)

    • Tested 0.2, 0.4, 0.8 based on 0.8 being too high
  2. Token limit investigation (Phase 3)

    • Increased from 4k to 16k/32k
  3. Model selection strategy (Phase 4)

    • Focus on coding-specialized models
    • Avoid general-purpose models for code tasks
  4. Evolution strategy guidelines (Phase 5)

    • Match strategy to model capability
    • Test both approaches for new models

Key Observations from Baseline Experiments

  1. Setup Validation: Single iteration tests successfully validated configurations
  2. Model Specialization: Coding-specialized models outperformed larger general models
  3. Evolution Strategy Impact: Wrong strategy choice degraded performance below baseline
  4. Baseline Importance: Initial measurements proved essential for comparison
  5. Time Investment: 100 iterations required hours but showed significant improvements

Experimental Patterns Observed

  1. New Model Testing:

    • Both evolution strategies showed different performance profiles
    • Temperature parameters between 0.4-0.6 were commonly used
    • 100 iterations provided stable performance measurements
  2. Performance Characteristics:

    • Specialized coding models achieved higher speedups
    • Diff-based evolution worked better for capable models
    • Extended iterations continued to improve results
Join the community