Introduction

The most common statistical methods for comparing machine learning models and human readers are p-value and confidence interval. Although receiving some criticism recently, p-value and confidence interval give more insight into results than a raw performance measure, if interpreted correctly, and are required by many journals.

This post shows an example python code utilizing bootstrapping for computing confidence intervals and p-values comparing machine learning models and human readers.

I will not discuss what p-value does or does not mean, what is the right threshold for statistical significance, or how to properly interpret it. Here are some resources that dive deeper into this topic:

Simple statistical toolset for machine learning

I published a GitHub repository ml-stat-util containing a set of simple functions written in Python for computing p-values and confidence intervals using bootstrapping. I will show how to use it in different common use cases.

A jupyter notebook with all use cases described below is available on GitHub.

Use case #1

Compute AUC with 95% confidence interval for a single model.

from sklearn.metrics import roc_auc_score
import stat_util

score, ci_lower, ci_upper, scores = stat_util.score_ci(
    y_true, y_pred, score_fun=roc_auc_score
)

To get an idea of what happened, we can plot a histogram of bootstrapped scores.

import matplotlib.pyplot as plt

bins = plt.hist(scores)
plt.plot([score, score], [0, np.max(bins[0])], color="tomato")
plt.plot([ci_lower, ci_lower], [0, np.max(bins[0])], color="lime")
plt.plot([ci_upper, ci_upper], [0, np.max(bins[0])], color="lime")

Histogram CI

Use case #2

Compare two models by computing p-value for a difference in their performance measured with AUC.

from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import stat_util

p, z = stat_util.pvalue(y_true, y_pred1, y_pred2, score_fun=roc_auc_score)
bins = plt.hist(z)
plt.plot([0, 0], [0, np.max(bins[0])], color="black")

Histogram p-value

Use case #3

Compute mean AUC with 95% confidence interval for a set of readers/models.

import numpy as np
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import stat_util

mean_score, ci_lower, ci_upper, scores = stat_util.score_stat_ci(
    y_true, y_pred_readers, score_fun=roc_auc_score, stat_fun=np.mean
)
bins = plt.hist(scores)
plt.plot([mean_score, mean_score], [0, np.max(bins[0])], color="tomato")
plt.plot([ci_lower, ci_lower], [0, np.max(bins[0])], color="lime")
plt.plot([ci_upper, ci_upper], [0, np.max(bins[0])], color="lime")

Histogram CI mean

Use case #4

Compare a single model to a set of readers by computing p-value for a difference in their performance measured with AUC.

import numpy as np
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import stat_util

p, z = stat_util.pvalue_stat(
    y_true, y_pred, y_pred_readers, score_fun=roc_auc_score, stat_fun=np.mean
)
bins = plt.hist(z)
plt.plot([0, 0], [0, np.max(bins[0])], color="black")

Histogram p-value