Quick Start Guide
This guide will help you get started with Calibration Toolbox.
Basic Usage
Computing Calibration Metrics
The simplest way to compute calibration metrics is to use the wrapper functions:
import numpy as np
from calibration_toolbox import expected_calibration_error
# Your model's predicted probabilities (n_samples, n_classes)
probabilities = np.array([
[0.8, 0.2],
[0.6, 0.4],
[0.9, 0.1],
[0.3, 0.7]
])
# True labels
labels = np.array([0, 1, 0, 1])
# Compute Expected Calibration Error
ece = expected_calibration_error(probabilities, labels)
print(f"ECE: {ece:.4f}")
Multiple Metrics
You can compute multiple calibration metrics:
from calibration_toolbox import ECE, MCE, RMSCE, ACE, SCE
ece = ECE(probabilities, labels)
mce = MCE(probabilities, labels)
rmsce = RMSCE(probabilities, labels)
ace = ACE(probabilities, labels)
sce = SCE(probabilities, labels)
print(f"ECE: {ece:.4f}")
print(f"MCE: {mce:.4f}")
print(f"RMSCE: {rmsce:.4f}")
print(f"ACE: {ace:.4f}")
print(f"SCE: {sce:.4f}")
General Calibration Error
The General Calibration Error (GCE) is a flexible framework that can compute various metrics:
from calibration_toolbox import general_calibration_error
# ECE: L1 norm, not class-conditional
ece = general_calibration_error(
probabilities, labels,
norm=1,
class_conditional=False
)
# MCE: L-infinity norm
mce = general_calibration_error(
probabilities, labels,
norm='inf',
class_conditional=False
)
# ACE: Class-conditional with adaptive bins
ace = general_calibration_error(
probabilities, labels,
norm=1,
class_conditional=True,
adaptive_bins=True,
top_k_classes='all'
)
Visualization
Reliability Diagram
A reliability diagram shows the relationship between predicted confidence and actual accuracy:
from calibration_toolbox import reliability_diagram
reliability_diagram(probabilities, labels, n_bins=10)
Confidence Histogram
A confidence histogram shows the distribution of model confidences:
from calibration_toolbox import confidence_histogram
confidence_histogram(probabilities, labels, n_bins=15)
Class-wise Calibration Curves
For multi-class problems, you can visualize per-class calibration:
from calibration_toolbox import class_wise_calibration_curve
# Multi-class probabilities
probs = np.random.dirichlet(np.ones(5), size=100)
labels = np.random.randint(0, 5, size=100)
class_wise_calibration_curve(probs, labels)
Metric Comparison
Compare multiple calibration metrics in one plot:
from calibration_toolbox import calibration_error_decomposition
calibration_error_decomposition(probabilities, labels)
Working with Logits
If your model outputs logits instead of probabilities, set logits=True:
import numpy as np
from calibration_toolbox import expected_calibration_error
# Model outputs (logits)
logits = np.array([
[2.0, -1.0],
[1.0, 0.5],
[-0.5, 1.5]
])
labels = np.array([0, 0, 1])
# Compute ECE (will apply softmax internally)
ece = expected_calibration_error(logits, labels, logits=True)
Common Patterns
Evaluating a Trained Model
import numpy as np
from calibration_toolbox import ECE, reliability_diagram
# Get predictions from your model
# predictions = model.predict_proba(X_test)
# For this example, use dummy data
predictions = np.random.dirichlet(np.ones(3), size=200)
true_labels = np.random.randint(0, 3, size=200)
# Compute calibration error
ece = ECE(predictions, true_labels)
print(f"Model ECE: {ece:.4f}")
# Visualize calibration
reliability_diagram(predictions, true_labels,
title=f"Model Calibration (ECE: {ece:.4f})")
Comparing Multiple Models
from calibration_toolbox import ECE
# Predictions from different models
model1_probs = np.random.dirichlet(np.ones(2), size=100)
model2_probs = np.random.dirichlet(np.ones(2), size=100)
labels = np.random.randint(0, 2, size=100)
ece1 = ECE(model1_probs, labels)
ece2 = ECE(model2_probs, labels)
print(f"Model 1 ECE: {ece1:.4f}")
print(f"Model 2 ECE: {ece2:.4f}")
if ece1 < ece2:
print("Model 1 is better calibrated")
else:
print("Model 2 is better calibrated")
Next Steps
Check out the API Reference for detailed documentation of all functions
See the Examples for more comprehensive examples
Read about the References for the research papers behind each metric