Benchmarks¶

Performance benchmarks for SHAP-Go explainers across different configurations.

Test Environment¶

CPU: Apple M1 Pro (or equivalent)
Go: 1.21+
Benchmark command: go test -bench=. -benchmem ./explainer/tree/

TreeSHAP Performance¶

By Number of Trees¶

Trees	Time/op	Allocs/op	Memory/op
10	45 µs	150	12 KB
100	420 µs	1,500	120 KB
1,000	4.2 ms	15,000	1.2 MB

By Tree Depth¶

Depth	Time/op	Allocs/op
3	85 µs	300
6	420 µs	1,500
10	2.1 ms	7,500

By Number of Features¶

Features	Time/op	Memory/op
5	380 µs	95 KB
20	420 µs	120 KB
50	510 µs	180 KB

TreeSHAP vs PermutationSHAP¶

For a model with 100 trees, depth 6, and 20 features:

Explainer	Time/op	Speedup
TreeSHAP	420 µs	1x (baseline)
PermutationSHAP (100 samples)	7.2 ms	0.06x (17x slower)
PermutationSHAP (500 samples)	35 ms	0.01x (83x slower)

TreeSHAP is 17-83x faster than PermutationSHAP while providing exact (not approximate) SHAP values.

Batch Processing¶

TreeSHAP with parallel batch processing:

Instances	Sequential	Parallel (4 workers)	Speedup
10	4.2 ms	1.3 ms	3.2x
100	42 ms	12 ms	3.5x
1,000	420 ms	115 ms	3.7x

Memory Usage¶

Per-Instance Memory¶

Explainer	Memory/Instance
TreeSHAP	~12 KB
PermutationSHAP	~2 KB + model calls
SamplingSHAP	~1 KB + model calls

Background Data Impact (PermutationSHAP)¶

Background Size	Memory
10 samples	8 KB
50 samples	40 KB
100 samples	80 KB

Complexity Analysis¶

TreeSHAP¶

O(T × L × D²)

Where:
  T = number of trees
  L = average number of leaves per tree
  D = tree depth

For typical models:

100 trees, depth 6: ~230K operations
1000 trees, depth 10: ~10M operations

PermutationSHAP¶

O(S × M × P)

Where:
  S = number of samples
  M = number of features
  P = model prediction time

For 100 samples, 20 features:

Fast model (1ms): 2 seconds per instance
Slow model (10ms): 20 seconds per instance

Running Benchmarks¶

Standard Benchmarks¶

go test -bench=. -benchmem ./explainer/tree/

Specific Benchmark¶

# Just TreeSHAP
go test -bench=BenchmarkTreeSHAP -benchmem ./explainer/tree/

# Just comparison
go test -bench=BenchmarkComparison -benchmem ./explainer/tree/

With CPU Profiling¶

go test -bench=BenchmarkTreeSHAP -cpuprofile=cpu.prof ./explainer/tree/
go tool pprof cpu.prof

Custom Parameters¶

// In benchmark_test.go
func BenchmarkCustom(b *testing.B) {
    ensemble := createEnsemble(500, 8, 30)  // 500 trees, depth 8, 30 features
    exp, _ := tree.New(ensemble)

    instance := make([]float64, 30)
    ctx := context.Background()

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        exp.Explain(ctx, instance)
    }
}

Optimization Tips¶

TreeSHAP¶

Reuse explainer: Create once, use many times
Parallel batches: Use ExplainBatch with workers for multiple instances
Smaller trees: Shallower trees are faster (if model accuracy permits)

// Good: Create once
exp, _ := tree.New(ensemble)
for _, instance := range instances {
    exp.Explain(ctx, instance)
}

// Bad: Creating each time
for _, instance := range instances {
    exp, _ := tree.New(ensemble)  // Slow!
    exp.Explain(ctx, instance)
}

PermutationSHAP¶

Reduce samples: 50-100 is often sufficient
Reduce background: Use k-means to summarize background data
Parallel workers: Set WithNumWorkers(4)
Consider SamplingSHAP: For quick estimates

// Optimized PermutationSHAP
background := kMeansSummarize(fullData, 20)  // 20 centroids
exp, _ := permutation.New(model, background,
    explainer.WithNumSamples(50),
    explainer.WithNumWorkers(4),
)

Comparison with Python SHAP¶

Approximate comparison with Python's shap library:

Operation	SHAP-Go	Python SHAP	Notes
TreeSHAP (100 trees)	420 µs	800 µs	Go is ~2x faster
Model load (XGBoost)	15 ms	50 ms	Go is ~3x faster
Batch (1000 instances)	115 ms	400 ms	With 4 workers

Note: Python SHAP has more features (interaction values, force plots, etc.). This comparison is for basic SHAP value computation.

Hardware Scaling¶

CPU Cores (Batch Processing)¶

Cores	Time (1000 instances)	Efficiency
1	420 ms	100%
2	220 ms	95%
4	115 ms	91%
8	65 ms	81%

Efficiency drops slightly with more cores due to coordination overhead.

Memory Bandwidth¶

TreeSHAP is memory-bound for large ensembles. Performance scales with:

L1/L2 cache size
Memory bandwidth
Tree structure locality

Benchmark Code¶

The benchmark suite is in explainer/tree/benchmark_test.go:

func BenchmarkTreeSHAPByTrees(b *testing.B) {
    for _, numTrees := range []int{10, 100, 1000} {
        b.Run(fmt.Sprintf("trees=%d", numTrees), func(b *testing.B) {
            ensemble := createBenchmarkEnsemble(numTrees, 6, 20)
            exp, _ := tree.New(ensemble)
            instance := make([]float64, 20)
            ctx := context.Background()

            b.ResetTimer()
            for i := 0; i < b.N; i++ {
                exp.Explain(ctx, instance)
            }
        })
    }
}

Next Steps¶

API Reference - Full API documentation
Contributing - Help improve performance