A Babel Tower of Binary Classification

July 5, 2023

technology

Mark Shovman

COO | Eyeviation

“Binary classification is the task of classifying the elements of a given set into two groups (predicting which group each one belongs to)” (Wikipedia)

A common visualisation of a binary classification task is a 2 by 2 table

Consider a pregnancy test. A person can be either pregnant or not; and a pregnancy test will either tell that the person is pregnant or that they are not. The test results are usually correct, but there might be errors.

Or consider a hurricane alert. It should come before a hurricane, but sometimes there are errors. A false alarm is bad, but a hurricane that comes without a warning is worse. Or a search for alien life. Or quality assurance protocols…

All these are binary classification tasks. They occur in many fields — medicine, psychology, machine learning, statistics, quality assurance, weather forecasting, and so on. There is even a separate Signal Detection Theory.

Unfortunately, each field seems to have its own terminology around binary classification: for instance, a kind of error when a test shows that a person is pregnant when they are not, is called ‘False Positive’ in medicine, ‘False Alarm’ in signal detection, and ‘Type I Error’ in statistics. Very confusing.

I often work in interdisciplinary teams, and time and again find myself translating a term for the same thing, from, say, machine-learning-ese to medical-ese. At some point, I compiled a short glossary, and it has helped me a lot over the years. Here it is, hope it helps you too.

Part I — Basics

A common way to summarise the outcomes of a binary classification task is a 2-by-2 table, called, alternatively, an error matrix, a 2x2 contingency table, or, fittingly, a confusion matrix.

A glossary for basic elements of a confusion matrix

Part II — Conditional Probabilities

The second bundle of terminological confusion is ‘conditional probabilities’ — secondary calculations from the numbers in the confusion matrix. For instance, the probability that a pregnancy test tells someone is pregnant when they actually are (‘actually pregnant’ is the condition here). For a 2x2 case, there are eight conditional probabilities in total, and the terminology runs wild across the fields. This specific conditional probability (TP / (TP+FN)) is called, alternatively, ‘True Positive Rate’, ‘Sensitivity’, ‘Recall’, ‘Hit Rate’, and ‘Test Power’. If I missed any, please tell me and I’ll add them.

A glossary for conditional probabilities

Part III — Evaluation Metrics

The last part of where this glossary often comes useful, is the quality evaluation of a given binary classifier. Given two pregnancy tests, or several hurricane alert systems, how do we compare their quality to choose the better one? Each field seems to invent their own metric — and sometime they are the same, or very similar, but under different names.

Sometime, of course, different fields really need different metrics. For instance, in weather forecasting, there is no way to calculate the True Negatives (TN) — that would be every time that there was neither hurricane alert, nor a hurricane — and how does one count that? For these cases, the evaluation metrics have to be TN-independent — see below. If you want to read more, there’s an excellent (but longish) Wiki article [2]

A list of evaluation metrics by field

Summary

These are basically my cuff-notes, not very organised — I just hope they help someone make sense of a new terminology in an unfamiliar field. If you have some new fields/terms to add, of just wanted to share your experience— please comment.

Links

‍