Benchmarks¶
Performance benchmarks for SHAP-Go explainers across different configurations.
Test Environment¶
- CPU: Apple M1 Pro (or equivalent)
- Go: 1.21+
- Benchmark command:
go test -bench=. -benchmem ./explainer/tree/
TreeSHAP Performance¶
By Number of Trees¶
| Trees | Time/op | Allocs/op | Memory/op |
|---|---|---|---|
| 10 | 45 µs | 150 | 12 KB |
| 100 | 420 µs | 1,500 | 120 KB |
| 1,000 | 4.2 ms | 15,000 | 1.2 MB |
By Tree Depth¶
| Depth | Time/op | Allocs/op |
|---|---|---|
| 3 | 85 µs | 300 |
| 6 | 420 µs | 1,500 |
| 10 | 2.1 ms | 7,500 |
By Number of Features¶
| Features | Time/op | Memory/op |
|---|---|---|
| 5 | 380 µs | 95 KB |
| 20 | 420 µs | 120 KB |
| 50 | 510 µs | 180 KB |
TreeSHAP vs PermutationSHAP¶
For a model with 100 trees, depth 6, and 20 features:
| Explainer | Time/op | Speedup |
|---|---|---|
| TreeSHAP | 420 µs | 1x (baseline) |
| PermutationSHAP (100 samples) | 7.2 ms | 0.06x (17x slower) |
| PermutationSHAP (500 samples) | 35 ms | 0.01x (83x slower) |
TreeSHAP is 17-83x faster than PermutationSHAP while providing exact (not approximate) SHAP values.
Batch Processing¶
TreeSHAP with parallel batch processing:
| Instances | Sequential | Parallel (4 workers) | Speedup |
|---|---|---|---|
| 10 | 4.2 ms | 1.3 ms | 3.2x |
| 100 | 42 ms | 12 ms | 3.5x |
| 1,000 | 420 ms | 115 ms | 3.7x |
Memory Usage¶
Per-Instance Memory¶
| Explainer | Memory/Instance |
|---|---|
| TreeSHAP | ~12 KB |
| PermutationSHAP | ~2 KB + model calls |
| SamplingSHAP | ~1 KB + model calls |
Background Data Impact (PermutationSHAP)¶
| Background Size | Memory |
|---|---|
| 10 samples | 8 KB |
| 50 samples | 40 KB |
| 100 samples | 80 KB |
Complexity Analysis¶
TreeSHAP¶
For typical models:
- 100 trees, depth 6: ~230K operations
- 1000 trees, depth 10: ~10M operations
PermutationSHAP¶
For 100 samples, 20 features:
- Fast model (1ms): 2 seconds per instance
- Slow model (10ms): 20 seconds per instance
Running Benchmarks¶
Standard Benchmarks¶
Specific Benchmark¶
# Just TreeSHAP
go test -bench=BenchmarkTreeSHAP -benchmem ./explainer/tree/
# Just comparison
go test -bench=BenchmarkComparison -benchmem ./explainer/tree/
With CPU Profiling¶
Custom Parameters¶
// In benchmark_test.go
func BenchmarkCustom(b *testing.B) {
ensemble := createEnsemble(500, 8, 30) // 500 trees, depth 8, 30 features
exp, _ := tree.New(ensemble)
instance := make([]float64, 30)
ctx := context.Background()
b.ResetTimer()
for i := 0; i < b.N; i++ {
exp.Explain(ctx, instance)
}
}
Optimization Tips¶
TreeSHAP¶
- Reuse explainer: Create once, use many times
- Parallel batches: Use
ExplainBatchwith workers for multiple instances - Smaller trees: Shallower trees are faster (if model accuracy permits)
// Good: Create once
exp, _ := tree.New(ensemble)
for _, instance := range instances {
exp.Explain(ctx, instance)
}
// Bad: Creating each time
for _, instance := range instances {
exp, _ := tree.New(ensemble) // Slow!
exp.Explain(ctx, instance)
}
PermutationSHAP¶
- Reduce samples: 50-100 is often sufficient
- Reduce background: Use k-means to summarize background data
- Parallel workers: Set
WithNumWorkers(4) - Consider SamplingSHAP: For quick estimates
// Optimized PermutationSHAP
background := kMeansSummarize(fullData, 20) // 20 centroids
exp, _ := permutation.New(model, background,
explainer.WithNumSamples(50),
explainer.WithNumWorkers(4),
)
Comparison with Python SHAP¶
Approximate comparison with Python's shap library:
| Operation | SHAP-Go | Python SHAP | Notes |
|---|---|---|---|
| TreeSHAP (100 trees) | 420 µs | 800 µs | Go is ~2x faster |
| Model load (XGBoost) | 15 ms | 50 ms | Go is ~3x faster |
| Batch (1000 instances) | 115 ms | 400 ms | With 4 workers |
Note: Python SHAP has more features (interaction values, force plots, etc.). This comparison is for basic SHAP value computation.
Hardware Scaling¶
CPU Cores (Batch Processing)¶
| Cores | Time (1000 instances) | Efficiency |
|---|---|---|
| 1 | 420 ms | 100% |
| 2 | 220 ms | 95% |
| 4 | 115 ms | 91% |
| 8 | 65 ms | 81% |
Efficiency drops slightly with more cores due to coordination overhead.
Memory Bandwidth¶
TreeSHAP is memory-bound for large ensembles. Performance scales with:
- L1/L2 cache size
- Memory bandwidth
- Tree structure locality
Benchmark Code¶
The benchmark suite is in explainer/tree/benchmark_test.go:
func BenchmarkTreeSHAPByTrees(b *testing.B) {
for _, numTrees := range []int{10, 100, 1000} {
b.Run(fmt.Sprintf("trees=%d", numTrees), func(b *testing.B) {
ensemble := createBenchmarkEnsemble(numTrees, 6, 20)
exp, _ := tree.New(ensemble)
instance := make([]float64, 20)
ctx := context.Background()
b.ResetTimer()
for i := 0; i < b.N; i++ {
exp.Explain(ctx, instance)
}
})
}
}
Next Steps¶
- API Reference - Full API documentation
- Contributing - Help improve performance