Skip to content

Benchmarks

Performance benchmarks for SHAP-Go explainers across different configurations.

Test Environment

  • CPU: Apple M1 Pro (or equivalent)
  • Go: 1.21+
  • Benchmark command: go test -bench=. -benchmem ./explainer/tree/

TreeSHAP Performance

By Number of Trees

Trees Time/op Allocs/op Memory/op
10 45 µs 150 12 KB
100 420 µs 1,500 120 KB
1,000 4.2 ms 15,000 1.2 MB

By Tree Depth

Depth Time/op Allocs/op
3 85 µs 300
6 420 µs 1,500
10 2.1 ms 7,500

By Number of Features

Features Time/op Memory/op
5 380 µs 95 KB
20 420 µs 120 KB
50 510 µs 180 KB

TreeSHAP vs PermutationSHAP

For a model with 100 trees, depth 6, and 20 features:

Explainer Time/op Speedup
TreeSHAP 420 µs 1x (baseline)
PermutationSHAP (100 samples) 7.2 ms 0.06x (17x slower)
PermutationSHAP (500 samples) 35 ms 0.01x (83x slower)

TreeSHAP is 17-83x faster than PermutationSHAP while providing exact (not approximate) SHAP values.

Batch Processing

TreeSHAP with parallel batch processing:

Instances Sequential Parallel (4 workers) Speedup
10 4.2 ms 1.3 ms 3.2x
100 42 ms 12 ms 3.5x
1,000 420 ms 115 ms 3.7x

Memory Usage

Per-Instance Memory

Explainer Memory/Instance
TreeSHAP ~12 KB
PermutationSHAP ~2 KB + model calls
SamplingSHAP ~1 KB + model calls

Background Data Impact (PermutationSHAP)

Background Size Memory
10 samples 8 KB
50 samples 40 KB
100 samples 80 KB

Complexity Analysis

TreeSHAP

O(T × L × D²)

Where:
  T = number of trees
  L = average number of leaves per tree
  D = tree depth

For typical models:

  • 100 trees, depth 6: ~230K operations
  • 1000 trees, depth 10: ~10M operations

PermutationSHAP

O(S × M × P)

Where:
  S = number of samples
  M = number of features
  P = model prediction time

For 100 samples, 20 features:

  • Fast model (1ms): 2 seconds per instance
  • Slow model (10ms): 20 seconds per instance

Running Benchmarks

Standard Benchmarks

go test -bench=. -benchmem ./explainer/tree/

Specific Benchmark

# Just TreeSHAP
go test -bench=BenchmarkTreeSHAP -benchmem ./explainer/tree/

# Just comparison
go test -bench=BenchmarkComparison -benchmem ./explainer/tree/

With CPU Profiling

go test -bench=BenchmarkTreeSHAP -cpuprofile=cpu.prof ./explainer/tree/
go tool pprof cpu.prof

Custom Parameters

// In benchmark_test.go
func BenchmarkCustom(b *testing.B) {
    ensemble := createEnsemble(500, 8, 30)  // 500 trees, depth 8, 30 features
    exp, _ := tree.New(ensemble)

    instance := make([]float64, 30)
    ctx := context.Background()

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        exp.Explain(ctx, instance)
    }
}

Optimization Tips

TreeSHAP

  1. Reuse explainer: Create once, use many times
  2. Parallel batches: Use ExplainBatch with workers for multiple instances
  3. Smaller trees: Shallower trees are faster (if model accuracy permits)
// Good: Create once
exp, _ := tree.New(ensemble)
for _, instance := range instances {
    exp.Explain(ctx, instance)
}

// Bad: Creating each time
for _, instance := range instances {
    exp, _ := tree.New(ensemble)  // Slow!
    exp.Explain(ctx, instance)
}

PermutationSHAP

  1. Reduce samples: 50-100 is often sufficient
  2. Reduce background: Use k-means to summarize background data
  3. Parallel workers: Set WithNumWorkers(4)
  4. Consider SamplingSHAP: For quick estimates
// Optimized PermutationSHAP
background := kMeansSummarize(fullData, 20)  // 20 centroids
exp, _ := permutation.New(model, background,
    explainer.WithNumSamples(50),
    explainer.WithNumWorkers(4),
)

Comparison with Python SHAP

Approximate comparison with Python's shap library:

Operation SHAP-Go Python SHAP Notes
TreeSHAP (100 trees) 420 µs 800 µs Go is ~2x faster
Model load (XGBoost) 15 ms 50 ms Go is ~3x faster
Batch (1000 instances) 115 ms 400 ms With 4 workers

Note: Python SHAP has more features (interaction values, force plots, etc.). This comparison is for basic SHAP value computation.

Hardware Scaling

CPU Cores (Batch Processing)

Cores Time (1000 instances) Efficiency
1 420 ms 100%
2 220 ms 95%
4 115 ms 91%
8 65 ms 81%

Efficiency drops slightly with more cores due to coordination overhead.

Memory Bandwidth

TreeSHAP is memory-bound for large ensembles. Performance scales with:

  • L1/L2 cache size
  • Memory bandwidth
  • Tree structure locality

Benchmark Code

The benchmark suite is in explainer/tree/benchmark_test.go:

func BenchmarkTreeSHAPByTrees(b *testing.B) {
    for _, numTrees := range []int{10, 100, 1000} {
        b.Run(fmt.Sprintf("trees=%d", numTrees), func(b *testing.B) {
            ensemble := createBenchmarkEnsemble(numTrees, 6, 20)
            exp, _ := tree.New(ensemble)
            instance := make([]float64, 20)
            ctx := context.Background()

            b.ResetTimer()
            for i := 0; i < b.N; i++ {
                exp.Explain(ctx, instance)
            }
        })
    }
}

Next Steps