On the appropriateness of Platt scaling in classifier calibration
Many applications using data mining and machine learning techniques require posterior probability estimates besides often highly accurate predictions. Classifier calibration is a separate branch of machine learning that aims at transforming classifier predictions into posterior class probabilities and thus are useful additional extensions in the respective applications. Among the existing state-of-the- art classifier calibration techniques, Platt scaling (sometimes also called sigmoid or logistic calibration) actually is the only parametric one while almost all of its competing methods do not rely on parametric assumptions. Platt scaling is controversially discussed in the classifier calibration literature, despite good empirical results reported in many domains there are many authors criticizing it. Interestingly, none of these criticisms properly deal with the underlying parametric assumptions. Instead, even incorrect statements exist. Thus, the first contribution of this work is to review such criticism and to present a proof of the true parametric assumptions. In fact, these are more general and valid for different probability distributions, which is an immediate consequence. Next, the relationship between Platt scaling and a different, relatively new classifier calibration technique called beta calibration is analyzed and it is shown that these two are actually equivalent: Their only difference lies in the characteristics of the classifier whose predictions are calibrated. Thus, the proven validity of Platt scaling additionally translates directly into a proven optimality of beta calibration. Furthermore, evaluating classifier calibration techniques is a highly non-trivial problem as the true posteriors cannot be used as a reference. Hence, the existing evaluation metrics are reviewed as well because there exist relatively popular evaluation criteria that should not be used at all. Finally, the theoretical findings are supported by a simulation study.
|Platt scaling, Classifier calibration, Posterior probability, Logistic regression, Proper scoring rules