Metrics Module

The metrics module provides calibration error metrics for classification models.

Calibration metrics for evaluating model uncertainty.

This module provides a comprehensive collection of binning-based calibration metrics for classification models, centered around the General Calibration Error (GCE) framework.

calibration_toolbox.metrics.general_calibration_error(probabilities, labels, n_bins=15, class_conditional=False, adaptive_bins=False, top_k_classes=1, norm=1, thresholding=0.0, logits=False)[source]

Calculate General Calibration Error (GCE).

The GCE is a flexible calibration metric that can be configured to produce many popular calibration metrics including ECE, MCE, RMSCE, ACE, and SCE.

The class-conditional GCE with L^p norm is defined as: GCE = (Σ_k Σ_b (n_bk / NK) |acc(b,k) - conf(b,k)|^p)^(1/p)

Where acc(b,k) and conf(b,k) are the accuracy and confidence of bin b for class label k; n_bk is the number of predictions in bin b for class k; N is the total number of data points; and K is the number of classes.

References:

Kull et al. (2019). “Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration.” NeurIPS.

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. class_conditional: If True, compute class-conditional calibration.

adaptive_bins: If True, use adaptive binning based on data distribution.: Default: False (uniform bins).
top_k_classes: Number of top predicted classes to consider. Use ‘all’: to consider all classes. Default: 1 (top prediction only).

norm: L^p norm to use. Can be 1, 2, or ‘inf’. Default: 1. thresholding: Ignore probabilities below this threshold. Default: 0.0. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: GCE value, typically between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> gce = general_calibration_error(probs, labels)
>>> print(f"GCE: {gce:.4f}")

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
class_conditional (bool)
adaptive_bins (bool)
top_k_classes (int | Literal['all'])
norm (int | Literal['inf'])
thresholding (float)
logits (bool)

Return type:

float

calibration_toolbox.metrics.expected_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Expected Calibration Error (ECE).

ECE measures the difference between model confidence and accuracy across uniformly-spaced bins. It is defined as: ECE = Σ_b (n_b / N) |acc(b) - conf(b)|

Reference:

Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: ECE value between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> ece = expected_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.maximum_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Maximum Calibration Error (MCE).

MCE is the maximum calibration error across all bins: MCE = max_b |acc(b) - conf(b)|

Reference:

Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: MCE value between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> mce = maximum_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.root_mean_square_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Root Mean Square Calibration Error (RMSCE).

RMSCE is the root mean square of calibration errors across bins: RMSCE = sqrt(Σ_b (n_b / N) (acc(b) - conf(b))^2)

Reference:

Hendrycks et al. (2019). “Deep Anomaly Detection with Outlier Exposure.” ICLR.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: RMSCE value between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> rmsce = root_mean_square_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.static_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Static Calibration Error (SCE).

SCE is the class-conditional calibration error with uniform binning, averaged across all classes: SCE = (1/K) Σ_k Σ_b (n_bk / N) |acc(b,k) - conf(b,k)|

Reference:

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: SCE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> sce = static_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.adaptive_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Adaptive Calibration Error (ACE).

ACE is the class-conditional calibration error with adaptive binning (equal number of samples per bin), averaged across all classes.

Reference:

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: ACE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> ace = adaptive_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.top_k_calibration_error(probabilities, labels, k=1, n_bins=15, logits=False)[source]

Calculate Top-k Calibration Error.

Computes calibration error for the top-k predicted classes, averaged across the k classes.

Reference:

Gupta et al. (2021). “Calibration of Neural Networks using Splines.” ICLR.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. k: Number of top classes to consider. Default: 1. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: Top-k calibration error (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> top2_ce = top_k_calibration_error(probs, labels, k=2)

Parameters:

probabilities (ndarray)
labels (ndarray)
k (int)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.thresholded_adaptive_calibration_error(probabilities, labels, threshold=0.01, n_bins=15, logits=False)[source]

Calculate Thresholded Adaptive Calibration Error (TACE).

TACE ignores predictions with confidence below a threshold before computing the adaptive calibration error.

Reference:

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. threshold: Confidence threshold. Predictions below this are ignored.

n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: TACE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> tace = thresholded_adaptive_calibration_error(probs, labels, threshold=0.01)

Parameters:

probabilities (ndarray)
labels (ndarray)
threshold (float)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.overconfidence_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Overconfidence Error (OE).

OE measures the degree of overconfidence, penalizing confident but incorrect predictions more heavily: OE = Σ_b (n_b / N) * conf(b) * max(conf(b) - acc(b), 0)

Reference:

Thulasidasan et al. (2019). “On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks.” NeurIPS.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: OE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> oe = overconfidence_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.ECE(probabilities, labels, n_bins=15, logits=False)

Calculate Expected Calibration Error (ECE).

ECE measures the difference between model confidence and accuracy across uniformly-spaced bins. It is defined as: ECE = Σ_b (n_b / N) |acc(b) - conf(b)|

Reference:

Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: ECE value between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> ece = expected_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.MCE(probabilities, labels, n_bins=15, logits=False)

Calculate Maximum Calibration Error (MCE).

MCE is the maximum calibration error across all bins: MCE = max_b |acc(b) - conf(b)|

Reference:

Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: MCE value between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> mce = maximum_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.RMSCE(probabilities, labels, n_bins=15, logits=False)

Calculate Root Mean Square Calibration Error (RMSCE).

RMSCE is the root mean square of calibration errors across bins: RMSCE = sqrt(Σ_b (n_b / N) (acc(b) - conf(b))^2)

Reference:

Hendrycks et al. (2019). “Deep Anomaly Detection with Outlier Exposure.” ICLR.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: RMSCE value between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> rmsce = root_mean_square_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.SCE(probabilities, labels, n_bins=15, logits=False)

Calculate Static Calibration Error (SCE).

SCE is the class-conditional calibration error with uniform binning, averaged across all classes: SCE = (1/K) Σ_k Σ_b (n_bk / N) |acc(b,k) - conf(b,k)|

Reference:

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: SCE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> sce = static_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.ACE(probabilities, labels, n_bins=15, logits=False)

Calculate Adaptive Calibration Error (ACE).

ACE is the class-conditional calibration error with adaptive binning (equal number of samples per bin), averaged across all classes.

Reference:

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: ACE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> ace = adaptive_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.TACE(probabilities, labels, threshold=0.01, n_bins=15, logits=False)

Calculate Thresholded Adaptive Calibration Error (TACE).

TACE ignores predictions with confidence below a threshold before computing the adaptive calibration error.

Reference:

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. threshold: Confidence threshold. Predictions below this are ignored.

n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: TACE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> tace = thresholded_adaptive_calibration_error(probs, labels, threshold=0.01)

Parameters:

probabilities (ndarray)
labels (ndarray)
threshold (float)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.OE(probabilities, labels, n_bins=15, logits=False)

Calculate Overconfidence Error (OE).

OE measures the degree of overconfidence, penalizing confident but incorrect predictions more heavily: OE = Σ_b (n_b / N) * conf(b) * max(conf(b) - acc(b), 0)

Reference:

Thulasidasan et al. (2019). “On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks.” NeurIPS.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: OE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> oe = overconfidence_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.GCE(probabilities, labels, n_bins=15, class_conditional=False, adaptive_bins=False, top_k_classes=1, norm=1, thresholding=0.0, logits=False)

Calculate General Calibration Error (GCE).

The GCE is a flexible calibration metric that can be configured to produce many popular calibration metrics including ECE, MCE, RMSCE, ACE, and SCE.

The class-conditional GCE with L^p norm is defined as: GCE = (Σ_k Σ_b (n_bk / NK) |acc(b,k) - conf(b,k)|^p)^(1/p)

Where acc(b,k) and conf(b,k) are the accuracy and confidence of bin b for class label k; n_bk is the number of predictions in bin b for class k; N is the total number of data points; and K is the number of classes.

References:

Kull et al. (2019). “Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration.” NeurIPS.

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. class_conditional: If True, compute class-conditional calibration.

adaptive_bins: If True, use adaptive binning based on data distribution.: Default: False (uniform bins).
top_k_classes: Number of top predicted classes to consider. Use ‘all’: to consider all classes. Default: 1 (top prediction only).

norm: L^p norm to use. Can be 1, 2, or ‘inf’. Default: 1. thresholding: Ignore probabilities below this threshold. Default: 0.0. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: GCE value, typically between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> gce = general_calibration_error(probs, labels)
>>> print(f"GCE: {gce:.4f}")

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
class_conditional (bool)
adaptive_bins (bool)
top_k_classes (int | Literal['all'])
norm (int | Literal['inf'])
thresholding (float)
logits (bool)

Return type:

float

General Calibration Error

calibration_toolbox.metrics.general_calibration_error(probabilities, labels, n_bins=15, class_conditional=False, adaptive_bins=False, top_k_classes=1, norm=1, thresholding=0.0, logits=False)[source]

Calculate General Calibration Error (GCE).

The GCE is a flexible calibration metric that can be configured to produce many popular calibration metrics including ECE, MCE, RMSCE, ACE, and SCE.

The class-conditional GCE with L^p norm is defined as: GCE = (Σ_k Σ_b (n_bk / NK) |acc(b,k) - conf(b,k)|^p)^(1/p)

Where acc(b,k) and conf(b,k) are the accuracy and confidence of bin b for class label k; n_bk is the number of predictions in bin b for class k; N is the total number of data points; and K is the number of classes.

References:

Kull et al. (2019). “Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration.” NeurIPS.

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. class_conditional: If True, compute class-conditional calibration.

adaptive_bins: If True, use adaptive binning based on data distribution.: Default: False (uniform bins).
top_k_classes: Number of top predicted classes to consider. Use ‘all’: to consider all classes. Default: 1 (top prediction only).

norm: L^p norm to use. Can be 1, 2, or ‘inf’. Default: 1. thresholding: Ignore probabilities below this threshold. Default: 0.0. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: GCE value, typically between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> gce = general_calibration_error(probs, labels)
>>> print(f"GCE: {gce:.4f}")

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
class_conditional (bool)
adaptive_bins (bool)
top_k_classes (int | Literal['all'])
norm (int | Literal['inf'])
thresholding (float)
logits (bool)

Return type:

float

Standard Metrics

calibration_toolbox.metrics.expected_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Expected Calibration Error (ECE).

ECE measures the difference between model confidence and accuracy across uniformly-spaced bins. It is defined as: ECE = Σ_b (n_b / N) |acc(b) - conf(b)|

Reference:

Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: ECE value between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> ece = expected_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.maximum_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Maximum Calibration Error (MCE).

MCE is the maximum calibration error across all bins: MCE = max_b |acc(b) - conf(b)|

Reference:

Naeini et al. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning.” AAAI.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: MCE value between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> mce = maximum_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.root_mean_square_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Root Mean Square Calibration Error (RMSCE).

RMSCE is the root mean square of calibration errors across bins: RMSCE = sqrt(Σ_b (n_b / N) (acc(b) - conf(b))^2)

Reference:

Hendrycks et al. (2019). “Deep Anomaly Detection with Outlier Exposure.” ICLR.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: RMSCE value between 0 and 1 (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> rmsce = root_mean_square_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

Class-Conditional Metrics

calibration_toolbox.metrics.static_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Static Calibration Error (SCE).

SCE is the class-conditional calibration error with uniform binning, averaged across all classes: SCE = (1/K) Σ_k Σ_b (n_bk / N) |acc(b,k) - conf(b,k)|

Reference:

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: SCE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> sce = static_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.adaptive_calibration_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Adaptive Calibration Error (ACE).

ACE is the class-conditional calibration error with adaptive binning (equal number of samples per bin), averaged across all classes.

Reference:

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: ACE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> ace = adaptive_calibration_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.top_k_calibration_error(probabilities, labels, k=1, n_bins=15, logits=False)[source]

Calculate Top-k Calibration Error.

Computes calibration error for the top-k predicted classes, averaged across the k classes.

Reference:

Gupta et al. (2021). “Calibration of Neural Networks using Splines.” ICLR.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. k: Number of top classes to consider. Default: 1. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: Top-k calibration error (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> top2_ce = top_k_calibration_error(probs, labels, k=2)

Parameters:

probabilities (ndarray)
labels (ndarray)
k (int)
n_bins (int)
logits (bool)

Return type:

float

calibration_toolbox.metrics.thresholded_adaptive_calibration_error(probabilities, labels, threshold=0.01, n_bins=15, logits=False)[source]

Calculate Thresholded Adaptive Calibration Error (TACE).

TACE ignores predictions with confidence below a threshold before computing the adaptive calibration error.

Reference:

Nixon et al. (2020). “Measuring Calibration in Deep Learning.” CVPR Workshops.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. threshold: Confidence threshold. Predictions below this are ignored.

n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: TACE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> tace = thresholded_adaptive_calibration_error(probs, labels, threshold=0.01)

Parameters:

probabilities (ndarray)
labels (ndarray)
threshold (float)
n_bins (int)
logits (bool)

Return type:

float

Other Metrics

calibration_toolbox.metrics.overconfidence_error(probabilities, labels, n_bins=15, logits=False)[source]

Calculate Overconfidence Error (OE).

OE measures the degree of overconfidence, penalizing confident but incorrect predictions more heavily: OE = Σ_b (n_b / N) * conf(b) * max(conf(b) - acc(b), 0)

Reference:

Thulasidasan et al. (2019). “On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks.” NeurIPS.

Args:

probabilities: Array of shape (n_samples, n_classes) containing: predicted probabilities for each class.

labels: Array of shape (n_samples,) containing true class labels. n_bins: Number of bins for confidence discretization. Default: 15. logits: If True, input is logits and will be converted to probabilities.

Returns:

float: OE value (lower is better).

Example:

>>> probs = np.array([[0.8, 0.2], [0.6, 0.4], [0.7, 0.3]])
>>> labels = np.array([0, 1, 0])
>>> oe = overconfidence_error(probs, labels)

Parameters:

probabilities (ndarray)
labels (ndarray)
n_bins (int)
logits (bool)

Return type:

float