Metrics Module
The metrics module provides calibration error metrics for classification models.
Calibration metrics for evaluating model uncertainty.
This module provides a comprehensive collection of binning-based calibration metrics for classification models, centered around the General Calibration Error (GCE) framework.
- calibration_toolbox.metrics.general_calibration_error(probabilities, labels, n_bins=15, class_conditional=False, adaptive_bins=False, top_k_classes=1, norm=1, thresholding=0.0, logits=False)[source]
Calculate General Calibration Error (GCE).
The GCE is a flexible calibration metric that can be configured to produce many popular calibration metrics including ECE, MCE, RMSCE, ACE, and SCE.
The class-conditional GCE with L^p norm is defined as: GCE = (Σ_k Σ_b (n_bk / NK) |acc(b,k) - conf(b,k)|^p)^(1/p)
Where acc(b,k) and conf(b,k) are the accuracy and confidence of bin b for class label k; n_bk is the number of predictions in bin b for class k; N is the total number of data points; and K is the number of classes.
- References:
Kull et al. (2019). “Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration.” NeurIPS.
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. class_conditional: If True, compute class-conditional calibration.
Default: False.
- adaptive_bins: If True, use adaptive binning based on data distribution.
Default: False (uniform bins).
- top_k_classes: Number of top predicted classes to consider. Use ‘all’
to consider all classes. Default: 1 (top prediction only).
norm: L^p norm to use. Can be 1, 2, or ‘inf’. Default: 1. thresholding: Ignore probabilities below this threshold. Default: 0.0. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: GCE value, typically between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> gce = general_calibration_error(probs, labels) >>> print(f"GCE: {gce:.4f}")
- calibration_toolbox.metrics.expected_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Expected Calibration Error (ECE).
ECE measures the difference between model confidence and accuracy across uniformly-spaced bins. It is defined as: ECE = Σ_b (n_b / N) |acc(b) - conf(b)|
- Reference:
Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: ECE value between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> ece = expected_calibration_error(probs, labels)
- calibration_toolbox.metrics.maximum_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Maximum Calibration Error (MCE).
MCE is the maximum calibration error across all bins: MCE = max_b |acc(b) - conf(b)|
- Reference:
Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: MCE value between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> mce = maximum_calibration_error(probs, labels)
- calibration_toolbox.metrics.root_mean_square_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Root Mean Square Calibration Error (RMSCE).
RMSCE is the root mean square of calibration errors across bins: RMSCE = sqrt(Σ_b (n_b / N) (acc(b) - conf(b))^2)
- Reference:
Hendrycks et al. (2019). “Deep Anomaly Detection with Outlier Exposure.” ICLR.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: RMSCE value between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> rmsce = root_mean_square_calibration_error(probs, labels)
- calibration_toolbox.metrics.static_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Static Calibration Error (SCE).
SCE is the class-conditional calibration error with uniform binning, averaged across all classes: SCE = (1/K) Σ_k Σ_b (n_bk / N) |acc(b,k) - conf(b,k)|
- Reference:
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: SCE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> sce = static_calibration_error(probs, labels)
- calibration_toolbox.metrics.adaptive_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Adaptive Calibration Error (ACE).
ACE is the class-conditional calibration error with adaptive binning (equal number of samples per bin), averaged across all classes.
- Reference:
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: ACE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> ace = adaptive_calibration_error(probs, labels)
- calibration_toolbox.metrics.top_k_calibration_error(probabilities, labels, k=1, n_bins=15, logits=False)[source]
Calculate Top-k Calibration Error.
Computes calibration error for the top-k predicted classes, averaged across the k classes.
- Reference:
Gupta et al. (2021). “Calibration of Neural Networks using Splines.” ICLR.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. k: Number of top classes to consider. Default: 1. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: Top-k calibration error (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> top2_ce = top_k_calibration_error(probs, labels, k=2)
- calibration_toolbox.metrics.thresholded_adaptive_calibration_error(probabilities, labels, threshold=0.01, n_bins=15, logits=False)[source]
Calculate Thresholded Adaptive Calibration Error (TACE).
TACE ignores predictions with confidence below a threshold before computing the adaptive calibration error.
- Reference:
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. threshold: Confidence threshold. Predictions below this are ignored.
Default: 0.01.
n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: TACE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> tace = thresholded_adaptive_calibration_error(probs, labels, threshold=0.01)
- calibration_toolbox.metrics.overconfidence_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Overconfidence Error (OE).
OE measures the degree of overconfidence, penalizing confident but incorrect predictions more heavily: OE = Σ_b (n_b / N) * conf(b) * max(conf(b) - acc(b), 0)
- Reference:
Thulasidasan et al. (2019). “On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks.” NeurIPS.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: OE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> oe = overconfidence_error(probs, labels)
- calibration_toolbox.metrics.ECE(probabilities, labels, n_bins=15, logits=False)
Calculate Expected Calibration Error (ECE).
ECE measures the difference between model confidence and accuracy across uniformly-spaced bins. It is defined as: ECE = Σ_b (n_b / N) |acc(b) - conf(b)|
- Reference:
Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: ECE value between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> ece = expected_calibration_error(probs, labels)
- calibration_toolbox.metrics.MCE(probabilities, labels, n_bins=15, logits=False)
Calculate Maximum Calibration Error (MCE).
MCE is the maximum calibration error across all bins: MCE = max_b |acc(b) - conf(b)|
- Reference:
Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: MCE value between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> mce = maximum_calibration_error(probs, labels)
- calibration_toolbox.metrics.RMSCE(probabilities, labels, n_bins=15, logits=False)
Calculate Root Mean Square Calibration Error (RMSCE).
RMSCE is the root mean square of calibration errors across bins: RMSCE = sqrt(Σ_b (n_b / N) (acc(b) - conf(b))^2)
- Reference:
Hendrycks et al. (2019). “Deep Anomaly Detection with Outlier Exposure.” ICLR.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: RMSCE value between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> rmsce = root_mean_square_calibration_error(probs, labels)
- calibration_toolbox.metrics.SCE(probabilities, labels, n_bins=15, logits=False)
Calculate Static Calibration Error (SCE).
SCE is the class-conditional calibration error with uniform binning, averaged across all classes: SCE = (1/K) Σ_k Σ_b (n_bk / N) |acc(b,k) - conf(b,k)|
- Reference:
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: SCE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> sce = static_calibration_error(probs, labels)
- calibration_toolbox.metrics.ACE(probabilities, labels, n_bins=15, logits=False)
Calculate Adaptive Calibration Error (ACE).
ACE is the class-conditional calibration error with adaptive binning (equal number of samples per bin), averaged across all classes.
- Reference:
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: ACE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> ace = adaptive_calibration_error(probs, labels)
- calibration_toolbox.metrics.TACE(probabilities, labels, threshold=0.01, n_bins=15, logits=False)
Calculate Thresholded Adaptive Calibration Error (TACE).
TACE ignores predictions with confidence below a threshold before computing the adaptive calibration error.
- Reference:
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. threshold: Confidence threshold. Predictions below this are ignored.
Default: 0.01.
n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: TACE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> tace = thresholded_adaptive_calibration_error(probs, labels, threshold=0.01)
- calibration_toolbox.metrics.OE(probabilities, labels, n_bins=15, logits=False)
Calculate Overconfidence Error (OE).
OE measures the degree of overconfidence, penalizing confident but incorrect predictions more heavily: OE = Σ_b (n_b / N) * conf(b) * max(conf(b) - acc(b), 0)
- Reference:
Thulasidasan et al. (2019). “On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks.” NeurIPS.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: OE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> oe = overconfidence_error(probs, labels)
- calibration_toolbox.metrics.GCE(probabilities, labels, n_bins=15, class_conditional=False, adaptive_bins=False, top_k_classes=1, norm=1, thresholding=0.0, logits=False)
Calculate General Calibration Error (GCE).
The GCE is a flexible calibration metric that can be configured to produce many popular calibration metrics including ECE, MCE, RMSCE, ACE, and SCE.
The class-conditional GCE with L^p norm is defined as: GCE = (Σ_k Σ_b (n_bk / NK) |acc(b,k) - conf(b,k)|^p)^(1/p)
Where acc(b,k) and conf(b,k) are the accuracy and confidence of bin b for class label k; n_bk is the number of predictions in bin b for class k; N is the total number of data points; and K is the number of classes.
- References:
Kull et al. (2019). “Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration.” NeurIPS.
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. class_conditional: If True, compute class-conditional calibration.
Default: False.
- adaptive_bins: If True, use adaptive binning based on data distribution.
Default: False (uniform bins).
- top_k_classes: Number of top predicted classes to consider. Use ‘all’
to consider all classes. Default: 1 (top prediction only).
norm: L^p norm to use. Can be 1, 2, or ‘inf’. Default: 1. thresholding: Ignore probabilities below this threshold. Default: 0.0. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: GCE value, typically between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> gce = general_calibration_error(probs, labels) >>> print(f"GCE: {gce:.4f}")
General Calibration Error
- calibration_toolbox.metrics.general_calibration_error(probabilities, labels, n_bins=15, class_conditional=False, adaptive_bins=False, top_k_classes=1, norm=1, thresholding=0.0, logits=False)[source]
Calculate General Calibration Error (GCE).
The GCE is a flexible calibration metric that can be configured to produce many popular calibration metrics including ECE, MCE, RMSCE, ACE, and SCE.
The class-conditional GCE with L^p norm is defined as: GCE = (Σ_k Σ_b (n_bk / NK) |acc(b,k) - conf(b,k)|^p)^(1/p)
Where acc(b,k) and conf(b,k) are the accuracy and confidence of bin b for class label k; n_bk is the number of predictions in bin b for class k; N is the total number of data points; and K is the number of classes.
- References:
Kull et al. (2019). “Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration.” NeurIPS.
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. class_conditional: If True, compute class-conditional calibration.
Default: False.
- adaptive_bins: If True, use adaptive binning based on data distribution.
Default: False (uniform bins).
- top_k_classes: Number of top predicted classes to consider. Use ‘all’
to consider all classes. Default: 1 (top prediction only).
norm: L^p norm to use. Can be 1, 2, or ‘inf’. Default: 1. thresholding: Ignore probabilities below this threshold. Default: 0.0. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: GCE value, typically between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> gce = general_calibration_error(probs, labels) >>> print(f"GCE: {gce:.4f}")
Standard Metrics
- calibration_toolbox.metrics.expected_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Expected Calibration Error (ECE).
ECE measures the difference between model confidence and accuracy across uniformly-spaced bins. It is defined as: ECE = Σ_b (n_b / N) |acc(b) - conf(b)|
- Reference:
Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: ECE value between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> ece = expected_calibration_error(probs, labels)
- calibration_toolbox.metrics.maximum_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Maximum Calibration Error (MCE).
MCE is the maximum calibration error across all bins: MCE = max_b |acc(b) - conf(b)|
- Reference:
Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: MCE value between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> mce = maximum_calibration_error(probs, labels)
- calibration_toolbox.metrics.root_mean_square_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Root Mean Square Calibration Error (RMSCE).
RMSCE is the root mean square of calibration errors across bins: RMSCE = sqrt(Σ_b (n_b / N) (acc(b) - conf(b))^2)
- Reference:
Hendrycks et al. (2019). “Deep Anomaly Detection with Outlier Exposure.” ICLR.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: RMSCE value between 0 and 1 (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> rmsce = root_mean_square_calibration_error(probs, labels)
Class-Conditional Metrics
- calibration_toolbox.metrics.static_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Static Calibration Error (SCE).
SCE is the class-conditional calibration error with uniform binning, averaged across all classes: SCE = (1/K) Σ_k Σ_b (n_bk / N) |acc(b,k) - conf(b,k)|
- Reference:
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: SCE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> sce = static_calibration_error(probs, labels)
- calibration_toolbox.metrics.adaptive_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Adaptive Calibration Error (ACE).
ACE is the class-conditional calibration error with adaptive binning (equal number of samples per bin), averaged across all classes.
- Reference:
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: ACE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> ace = adaptive_calibration_error(probs, labels)
- calibration_toolbox.metrics.top_k_calibration_error(probabilities, labels, k=1, n_bins=15, logits=False)[source]
Calculate Top-k Calibration Error.
Computes calibration error for the top-k predicted classes, averaged across the k classes.
- Reference:
Gupta et al. (2021). “Calibration of Neural Networks using Splines.” ICLR.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. k: Number of top classes to consider. Default: 1. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: Top-k calibration error (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> top2_ce = top_k_calibration_error(probs, labels, k=2)
- calibration_toolbox.metrics.thresholded_adaptive_calibration_error(probabilities, labels, threshold=0.01, n_bins=15, logits=False)[source]
Calculate Thresholded Adaptive Calibration Error (TACE).
TACE ignores predictions with confidence below a threshold before computing the adaptive calibration error.
- Reference:
Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. threshold: Confidence threshold. Predictions below this are ignored.
Default: 0.01.
n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: TACE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> tace = thresholded_adaptive_calibration_error(probs, labels, threshold=0.01)
Other Metrics
- calibration_toolbox.metrics.overconfidence_error(probabilities, labels, n_bins=15, logits=False)[source]
Calculate Overconfidence Error (OE).
OE measures the degree of overconfidence, penalizing confident but incorrect predictions more heavily: OE = Σ_b (n_b / N) * conf(b) * max(conf(b) - acc(b), 0)
- Reference:
Thulasidasan et al. (2019). “On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks.” NeurIPS.
- Args:
- probabilities: Array of shape (n_samples, n_classes) containing
predicted probabilities for each class.
labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.
Default: False.
- Returns:
float: OE value (lower is better).
- Example:
>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]]) >>> labels = np.array([0, 1, 0]) >>> oe = overconfidence_error(probs, labels)